How subsampling works
G. T. Antell
Source:vignettes/articles/subsampling-concept-walkthrough.Rmd
subsampling-concept-walkthrough.Rmd
Circular subsampling
The divvy
package implements circular subsampling with
the cookies()
function, so-named because the routine
iteratively cuts equal-area sections out of the sampled world, just as a
cookie cutter turns an irregular sheet of dough into a batch of
equivalent shapes. The method was developed in Antell et al. (2020).
Basic procedure
The subsampling function operates on rasterised data, i.e. taxon occurrences allocated to spatial grid cells. Within a cell, a taxon is present or absent; the subsampling routine ignores abundance counts or duplicate occurrences. Cells could be divided by latitude-longitude, e.g. at 1-degree resolution, but subsampling would be fairer if the cells have equal area regardless of latitudinal position. The Equal Earth projection is an example of an equal-area raster system.
The procedure runs iteratively, returning a specified number of subsamples. The process of collecting a single subsample is demonstrated below. First, a single occupied cell is selected at random. This starting cell is the ‘seed point’.
To control the spatial extent of sampling, the seed cell is buffered to a circular area of given radius. This constraint is necessary for any type of diversity analysis because it helps standardise beta diversity, the amount of taxonomic turnover from increasingly distant sites.
From this step onward, cells are considered for the subsample only if their centroid coordinates fall within the buffer area. Included cells make up the pool for site rarefaction. The encompassing circle is laid down indiscriminately with regard to geography, so the pool size will vary with the amount of sea/land coverage in the region as well as the amount of scientific effort to discover and collect from sites. It is crucial to standardise the total area and number of sites because diversity is tied to sample area through the species–area effect: regions containing more sites will tend to return more taxa regardless of true taxonomic richness.
The final step of subsampling is selection of a given number of cells from the available pool. If cells are equal in area, this step standardises the total spatial area covered by the subsample. Cells are drawn without replacement, so it is possible to obtain a subsample only if there are at least as many occupied cells in the region as the desired site quota. Cells are considered viable as seed points if they lead to site pools of sufficient size; the subsampling function finds and excludes inviable cells a priori to draw subsamples efficiently from feasible regions.
Depending on the argument specified for function output, the
procedure returns the coordinates of the subsampled cells
(output = 'locs'
) or the data rows for occurrences
contained by those cells (output = 'full'
). The entire
subsampling routine is repeated for a specified number of iterations. A
given cell can be selected as seed point multiple times, so it is
possible for two or more subsamples to contain identical subsets of
cells. In general, occasional duplication is inherent to random
subsampling and will be inconsequential for later analysis. However, if
the original observations are very limited in number and extent, it may
be prudent to expand the radius and/or reduce the required cell quota to
reduce the incidence of identical subsamples. To check there are a
healthy number of unique site combinations relative to the number of
replications, the analyst could compare lists of included sites between
subsamples.
Weighted cell selection
The baseline method of circular subsampling described above selects cells at random within a radial constraint. A modification to the method selects cells probabilistically, with weight inversely proportional to squared distance from the central seed cell. This probabilistic approach clusters cells in the subsample more tightly while avoiding being completely deterministic; effectively, the modification provides an intermediate subsampling strategy between baseline circular and nearest-neighbor subsampling. If occupied cells are sparse, cells at the periphery of the buffer will be drawn to fill the cell quota despite their lower selection probability. The weighted cell subsampling routine was used to generate main-text results in Antell et al. (2020), although ultimately the results from unweighted and weighted subsampling were equivalent.
Nearest-neighbor subsampling
The clustr()
function in divvy
subsamples
sites according to proximity, a procedure adapted from those introduced
by Close et al. (2017, 2020).
Basic procedure
The clustr()
function groups sites together, starting
with the closest coordinates to the starting point. As with the cookies
function, the clustering routine ignores abundance counts or duplicate
occurrences, and the process of generating a subsample begins with
random selection of an occupied site as a seed point.
The function calculates the distance between the seed point and all other occupied sites. The closest site is linked to the seed (the first branch in a minimum spanning tree connecting subsample sites). The function again (1) calculates the pairwise distances between each point in the included set and all other occupied sites and then (2) adds the nearest neighbor to the set. The site cluster grows deterministically until the maximum diameter across the point set reaches a specified limit. Below is an illustration of the minimum spanning tree connecting all occupied sites in the example region.
Like the radial constraint in circular subsampling, the cap on cluster diameter attempts to standardise beta diversity (spatial turnover in taxa). Applying a maximum distance of 4 cell-width units (twice the radius used in the above example for cookies), some sites are excluded from the cluster built on this seed point.
As with circular subsampling, the analyst should experiment with diameter threshold values to choose a spatial bounding extent small enough to capture regional diversity differences but large enough to allow most sites around the world/study area to be gathered into clusters, rather than sit outside them.
The function returns the final cluster coordinates (or the occurrences located thereat, if requested), as one element in the list output. As many subsamples are returned as specified for the number of iterations. Due to the stochastic nature of drawing seed points and the finite number of sites available to aggregate, it is possible for multiple subsamples to contain identical site combinations. Site rarefaction (described below) reduces the incidence of duplicate clusters by further subsampling, but it is possible for the same sites to be drawn by chance, as with circular subsampling.
The original versions of nearest-neighbor subsampling grouped
clusters by continental region and nested analysis within these regions
before making global comparisons Close et al.
(2020). The exact method of cutting trees into continental
regions differed between the two papers, but regional assignments were
somewhat automated by looking up modern-day country codes associated
with occurrence coordinates. These approaches leave decisions for an
analyst about how to track bioregions across time and drifting tectonic
plates. The divvy
implementation of nearest-neighbor
subsampling is more like bootstrapping, where all subsamples are treated
equivalently, irrespective of position; the full set of iterations may
be analysed in an automated and simultaneous way.
Site rarefaction
Original versions of nearest-neighbor subsampling return all sites
within a cluster that meets size and completeness specifications Close et al. (2020). As noted for circular
subsampling above, however, the number of sites can vary widely between
clusters of similar extent. On average, more taxa will be observed in
clusters that include more sites/area because of the species–area
effect. Rarefaction of references or collections (defined in
Paleobiology Database data structures) is one strategy to attempt to
mitigate differences in spatial coverage within regional subsamples, but
the more direct approach is standardising the number of sites within
clusters. Site/cell rarefaction is always included under the
cookies()
function, but it is an optional choice with the
clustr()
function, since it was absent in the original
published uses. Forgoing site rarefaction increases compatibility with
how results were generated in the original published code. Implementing
site rarefaction is recommended and improves comparability between
circular and nearest-neighbor subsampling routines. In the example
below, occupied cells are drawn randomly without replacement from a full
cluster to reach a quota of 5 sites in the final subsample.