Cluster localities within regions of nearest neighbours

Spatially subsample a dataset based on minimum spanning trees connecting points within regions of set extent, with optional rarefaction to a site quota.

Usage

clustr(
  dat,
  xy,
  iter,
  nSite = NULL,
  distMax,
  nMin = 3,
  crs = "epsg:4326",
  output = "locs"
)

Arguments

dat: A data.frame or matrix containing the coordinate columns xy and any associated variables, e.g. taxon names.
xy: A vector of two elements, specifying the name or numeric position of columns in dat containing coordinates, e.g. longitude and latitude. Coordinates for any shared sampling sites should be identical, and where sites are raster cells, coordinates are usually expected to be cell centroids.
iter: The number of spatial subsamples to return
nSite: The quota of unique locations to include in each subsample.
distMax: Numeric value for maximum diameter (km) allowed across locations in a subsample
nMin: Numeric value for the minimum number of sites to be included in every returned subsample. If nSite supplied, nMin ignored.
crs: Coordinate reference system as a GDAL text string, EPSG code, or object of class crs. Default is latitude-longitude (EPSG:4326).
output: Whether the returned data should be two columns of subsample site coordinates (output = 'locs') or the subset of rows from dat associated with those coordinates (output = 'full').

Value

A list of length iter. Each element is a data.frame (or matrix, if dat is a matrix and output = 'full'). If nSite is supplied, each element contains nSite observations. If output = 'locs' (default), only the coordinates of subsampling locations are returned. If output = 'full', all dat columns are returned for the rows associated with the subsampled locations.

Details

Lagomarcino and Miller (2012) developed an iterative approach of aggregating localities to build clusters based on convex hulls, inspired by species-area curve analysis (Scheiner 2003). Close et al. (2017, 2020) refined the approach and changed the proximity metric from minimum convex hull area to minimum spanning tree length. The present implementation adapts code from Close et al. (2020) to add an option for site rarefaction after cluster construction and to grow trees at random starting points iter number of times (instead of a deterministic, exhaustive iteration at every unique location).

The function takes a single location as a starting (seed) point; the seed and its nearest neighbour initiate a spatial cluster. The distance between the two points is the first branch in a minimum spanning tree for the cluster. The location that has the shortest distance to any points already within the cluster is grouped in next, and its distance (branch) is added to the sum tree length. This iterative process continues until the largest distance between any two points in the cluster would exceed distMax km. In the rare case multiple candidate points are tied for minimum distance from the cluster, one point is selected at random as the next to include. Any tree with fewer than nMin points is prohibited.

In the case that nSite is supplied, nMin argument is ignored, and any tree with fewer than nSite points is prohibited. After building a tree as described above, a random set of nSite points within the cluster is taken (without replacement). The nSite argument makes clustr() comparable with cookies() in that it spatially standardises both extent and area/locality number.

The performance of clustr() is designed on the assumption iter is much larger than the number of unique localities. Internal code first calculates the full minimum spanning tree at every viable starting point before it then samples those trees (i.e. resamples and optionally rarefies) for the specified number of iterations. This sequence means the total run-time increases only marginally even as iter increases greatly. However, if there are a large number of sites, particularly a large number of densely-spaced sites, the calculations will be slow even for a small number of iterations.

References

Antell2020divvy

Close2017divvy

Close2020divvy

Lagomarcino2012divvy

Scheiner2003divvy

Examples

# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)

# sample 5 sets of 4 locations no more than 400km across
clustr(dat = pts, xy = 1:2, iter = 5,
       nSite = 4, distMax = 400)
#> [[1]]
#>           x         y
#> 10 145.0000 -25.00000
#> 8  143.8889 -23.88889
#> 9  144.4444 -24.44444
#> 6  142.7778 -22.77778
#> 
#> [[2]]
#>          x         y
#> 3 141.1111 -21.11111
#> 5 142.2222 -22.22222
#> 4 141.6667 -21.66667
#> 2 140.5556 -20.55556
#> 
#> [[3]]
#>           x         y
#> 6  142.7778 -22.77778
#> 10 145.0000 -25.00000
#> 7  143.3333 -23.33333
#> 8  143.8889 -23.88889
#> 
#> [[4]]
#>          x         y
#> 8 143.8889 -23.88889
#> 4 141.6667 -21.66667
#> 7 143.3333 -23.33333
#> 6 142.7778 -22.77778
#> 
#> [[5]]
#>           x         y
#> 7  143.3333 -23.33333
#> 10 145.0000 -25.00000
#> 9  144.4444 -24.44444
#> 8  143.8889 -23.88889
#>