Skip to contents

Spatially subsample a dataset to produce samples of standard area and extent.

Usage

cookies(
  dat,
  xy,
  iter,
  nSite,
  r,
  weight = FALSE,
  crs = "epsg:4326",
  output = "locs"
)

Arguments

dat

A data.frame or matrix containing the coordinate columns xy and any associated variables, e.g. taxon names.

xy

A vector of two elements, specifying the name or numeric position of columns in dat containing coordinates, e.g. longitude and latitude. Coordinates for any shared sampling sites should be identical, and where sites are raster cells, coordinates are usually expected to be cell centroids.

iter

The number of spatial subsamples to return

nSite

The quota of unique locations to include in each subsample.

r

Numeric value for the radius (km) defining the circular extent of each spatial subsample.

weight

Whether sites within the subsample radius should be drawn at random (weight = FALSE, default) or with probability inversely proportional to the square of their distance from the centre of the subsample region (weight = TRUE).

crs

Coordinate reference system as a GDAL text string, EPSG code, or object of class crs. Default is latitude-longitude (EPSG:4326).

output

Whether the returned data should be two columns of subsample site coordinates (output = 'locs') or the subset of rows from dat associated with those coordinates (output = 'full').

Value

A list of length iter. Each list element is a data.frame or matrix (matching the class of dat) with nSite observations. If output = 'locs'

(default), only the coordinates of subsampling locations are returned. If output = 'full', all dat columns are returned for the rows associated with the subsampled locations.

If weight = TRUE, the first observation in each returned subsample data.frame corresponds to the seed point. If weight = FALSE, observations are listed in the random order of which they were drawn.

Details

The function takes a single location as a starting (seed) point and circumscribes a buffer of r km around it. Buffer circles that span the antemeridian (180 degrees longitude) are wrapped as a multipolygon to prevent artificial truncation. After standardising radial extent, sites are drawn within the circular extent until a quota of nSite is met. Sites are sampled without replacement, so a location is used as a seed point only if it is within r km distance of at least nSite-1 locations. The method is introduced in Antell et al. (2020) and described in detail in Methods S1 therein.

The probability of drawing each site within the standardised extent is either equal (weight = FALSE) or proportional to the inverse-square of its distance from the seed point (weight = TRUE), which clusters subsample locations more tightly.

For geodetic coordinates (latitude-longitude), distances are calculated along great circle arcs. For Cartesian coordinates, distances are calculated in Euclidian space, in units associated with the projection CRS (e.g. metres).

References

Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065 .

See also

Examples

# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)

# sample 5 sets of 3 occurrences within 200km radius
cookies(dat = pts, xy = 1:2, iter = 5,
        nSite = 3, r = 200)
#> [[1]]
#>          x         y
#> 5 142.2222 -22.22222
#> 3 141.1111 -21.11111
#> 4 141.6667 -21.66667
#> 
#> [[2]]
#>           x         y
#> 8  143.8889 -23.88889
#> 9  144.4444 -24.44444
#> 10 145.0000 -25.00000
#> 
#> [[3]]
#>           x         y
#> 8  143.8889 -23.88889
#> 10 145.0000 -25.00000
#> 9  144.4444 -24.44444
#> 
#> [[4]]
#>          x         y
#> 4 141.6667 -21.66667
#> 5 142.2222 -22.22222
#> 6 142.7778 -22.77778
#> 
#> [[5]]
#>          x         y
#> 1 140.0000 -20.00000
#> 5 142.2222 -22.22222
#> 3 141.1111 -21.11111
#>