Spatially subsample a dataset to produce samples of standard area and extent.
Arguments
- dat
A
data.frame
ormatrix
containing the coordinate columnsxy
and any associated variables, e.g. taxon names.- xy
A vector of two elements, specifying the name or numeric position of columns in
dat
containing coordinates, e.g. longitude and latitude. Coordinates for any shared sampling sites should be identical, and where sites are raster cells, coordinates are usually expected to be cell centroids.- iter
The number of spatial subsamples to return
- nSite
The quota of unique locations to include in each subsample.
- r
Numeric value for the radius (km) defining the circular extent of each spatial subsample.
- weight
Whether sites within the subsample radius should be drawn at random (
weight = FALSE
, default) or with probability inversely proportional to the square of their distance from the centre of the subsample region (weight = TRUE
).- crs
Coordinate reference system as a GDAL text string, EPSG code, or object of class
crs
. Default is latitude-longitude (EPSG:4326
).- output
Whether the returned data should be two columns of subsample site coordinates (
output = 'locs'
) or the subset of rows fromdat
associated with those coordinates (output = 'full'
).
Value
A list of length iter
. Each list element is a
data.frame
or matrix
(matching the class of dat
)
with nSite
observations. If output = 'locs'
(default), only the coordinates of subsampling locations are returned.
If output = 'full'
, all dat
columns are returned for the
rows associated with the subsampled locations.
If weight = TRUE
, the first observation in each returned subsample
data.frame
corresponds to the seed point. If weight = FALSE
,
observations are listed in the random order of which they were drawn.
Details
The function takes a single location as a starting (seed) point and
circumscribes a buffer of r
km around it. Buffer circles that span
the antemeridian (180 degrees longitude) are wrapped as a multipolygon
to prevent artificial truncation. After standardising radial extent, sites
are drawn within the circular extent until a quota of nSite
is met.
Sites are sampled without replacement, so a location is used as a seed point
only if it is within r
km distance of at least nSite-1
locations.
The method is introduced in Antell et al. (2020) and described in
detail in Methods S1 therein.
The probability of drawing each site within the standardised extent is
either equal (weight = FALSE
) or proportional to the inverse-square
of its distance from the seed point (weight = TRUE
), which clusters
subsample locations more tightly.
For geodetic coordinates (latitude-longitude), distances are calculated along great circle arcs. For Cartesian coordinates, distances are calculated in Euclidian space, in units associated with the projection CRS (e.g. metres).
References
Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065 .
Examples
# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)
# sample 5 sets of 3 occurrences within 200km radius
cookies(dat = pts, xy = 1:2, iter = 5,
nSite = 3, r = 200)
#> [[1]]
#> x y
#> 5 142.2222 -22.22222
#> 3 141.1111 -21.11111
#> 4 141.6667 -21.66667
#>
#> [[2]]
#> x y
#> 8 143.8889 -23.88889
#> 9 144.4444 -24.44444
#> 10 145.0000 -25.00000
#>
#> [[3]]
#> x y
#> 8 143.8889 -23.88889
#> 10 145.0000 -25.00000
#> 9 144.4444 -24.44444
#>
#> [[4]]
#> x y
#> 4 141.6667 -21.66667
#> 5 142.2222 -22.22222
#> 6 142.7778 -22.77778
#>
#> [[5]]
#> x y
#> 1 140.0000 -20.00000
#> 5 142.2222 -22.22222
#> 3 141.1111 -21.11111
#>