Spatially subsample a dataset based on minimum spanning trees connecting points within regions of set extent, with optional rarefaction to a site quota.
Arguments
- dat
A
data.frame
ormatrix
containing the coordinate columnsxy
and any associated variables, e.g. taxon names.- xy
A vector of two elements, specifying the name or numeric position of columns in
dat
containing coordinates, e.g. longitude and latitude. Coordinates for any shared sampling sites should be identical, and where sites are raster cells, coordinates are usually expected to be cell centroids.- iter
The number of spatial subsamples to return
- nSite
The quota of unique locations to include in each subsample.
- distMax
Numeric value for maximum diameter (km) allowed across locations in a subsample
- nMin
Numeric value for the minimum number of sites to be included in every returned subsample. If
nSite
supplied,nMin
ignored.- crs
Coordinate reference system as a GDAL text string, EPSG code, or object of class
crs
. Default is latitude-longitude (EPSG:4326
).- output
Whether the returned data should be two columns of subsample site coordinates (
output = 'locs'
) or the subset of rows fromdat
associated with those coordinates (output = 'full'
).
Value
A list of length iter
. Each element is a data.frame
(or matrix
, if dat
is a matrix
and output = 'full'
).
If nSite
is supplied, each element contains nSite
observations.
If output = 'locs'
(default), only the coordinates of subsampling
locations are returned.
If output = 'full'
, all dat
columns are returned for the
rows associated with the subsampled locations.
Details
Lagomarcino and Miller (2012) developed an iterative approach of aggregating
localities to build clusters based on convex hulls, inspired by species-area
curve analysis (Scheiner 2003). Close et al. (2017, 2020) refined the approach and
changed the proximity metric from minimum convex hull area to minimum spanning
tree length. The present implementation adapts code from Close et al. (2020)
to add an option for site rarefaction after cluster construction and to grow
trees at random starting points iter
number of times (instead of a
deterministic, exhaustive iteration at every unique location).
The function takes a single location as a starting (seed) point; the seed
and its nearest neighbour initiate a spatial cluster. The distance between
the two points is the first branch in a minimum spanning tree for the cluster.
The location that has the shortest distance to any points already within the
cluster is grouped in next, and its distance (branch) is added to the sum
tree length. This iterative process continues until the largest distance
between any two points in the cluster would exceed distMax
km.
In the rare case multiple candidate points are tied for minimum distance
from the cluster, one point is selected at random as the next to include.
Any tree with fewer than nMin
points is prohibited.
In the case that nSite
is supplied, nMin
argument is ignored,
and any tree with fewer than nSite
points is prohibited.
After building a tree as described above, a random set of nSite
points
within the cluster is taken (without replacement).
The nSite
argument makes clustr()
comparable with cookies()
in that it spatially standardises both extent and area/locality number.
The performance of clustr()
is designed on the assumption iter
is much larger than the number of unique localities. Internal code first
calculates the full minimum spanning tree at every viable starting point
before it then samples those trees (i.e. resamples and optionally rarefies)
for the specified number of iterations. This sequence means the total
run-time increases only marginally even as iter
increases greatly.
However, if there are a large number of sites, particularly a large number
of densely-spaced sites, the calculations will be slow even for a
small number of iterations.
References
Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065 .
Close RA, Benson RB, Upchurch P, Butler RJ (2017). “Controlling for the species--area effect supports constrained long-term Mesozoic terrestrial vertebrate diversification.” Nature Communications, 8(1), 1--11. doi:10.1038/ncomms15381 .
Close RA, Benson RB, Saupe EE, Clapham ME, Butler RJ (2020). “The spatial structure of Phanerozoic marine animal diversity.” Science, 368(6489), 420-424. doi:10.1126/science.aay8309 .
Lagomarcino AJ, Miller AI (2012). “The relationship between genus richness and geographic area in Late Cretaceous marine biotas: Epicontinental sea versus open-ocean-facing settings.” PloS One, 7(8), e40472. doi:10.1371/journal.pone.0040472 .
Scheiner SM (2003). “Six types of species--area curves.” Global Ecology and Biogeography, 12(6), 441-447. doi:10.1046/j.1466-822X.2003.00061.x .
Examples
# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)
# sample 5 sets of 4 locations no more than 400km across
clustr(dat = pts, xy = 1:2, iter = 5,
nSite = 4, distMax = 400)
#> [[1]]
#> x y
#> 10 145.0000 -25.00000
#> 8 143.8889 -23.88889
#> 9 144.4444 -24.44444
#> 6 142.7778 -22.77778
#>
#> [[2]]
#> x y
#> 3 141.1111 -21.11111
#> 5 142.2222 -22.22222
#> 4 141.6667 -21.66667
#> 2 140.5556 -20.55556
#>
#> [[3]]
#> x y
#> 6 142.7778 -22.77778
#> 10 145.0000 -25.00000
#> 7 143.3333 -23.33333
#> 8 143.8889 -23.88889
#>
#> [[4]]
#> x y
#> 8 143.8889 -23.88889
#> 4 141.6667 -21.66667
#> 7 143.3333 -23.33333
#> 6 142.7778 -22.77778
#>
#> [[5]]
#> x y
#> 7 143.3333 -23.33333
#> 10 145.0000 -25.00000
#> 9 144.4444 -24.44444
#> 8 143.8889 -23.88889
#>