library(gstudio)
data(arapat)
<- data.frame( Allele=LETTERS[1:4], Frequency=0.25)
f
f#> Allele Frequency
#> 1 A 0.25
#> 2 B 0.25
#> 3 C 0.25
#> 4 D 0.25
19 Rarefaction
The primary reason for looking at diversity is to perform some comparison, which provides some insights into the biological and/or demographic processes influencing your data. Without a basis for comparison, diversity estimates are just numbers. However, deriving an estimate of diversity is a statistical sampling process and as such we must be aware of the consequences our sampling regime has on the interpretation of the data. This is where rarefaction comes in, a technique commonly used in ecology when comparing species richness among groups.
Here is an example of the problem sampling may interject into your analyses. Consider a single locus with four alleles.
Selected as a random sample from an infinite population. The first sample has 5 individuals.
<- make_population( f,N=5)
pop1 <- genetic_diversity( pop1, mode="Ae" )
ae1
ae1#> Locus Ae
#> 1 Locus 3.846154
And the second one has 100 individuals.
<- make_population( f, N=100 )
pop2 <- genetic_diversity( pop2 )
ae2
ae2#> Locus Ae
#> 1 Locus 3.9992
The difference in estimated diversity among these groups are abs( ae1 - ae2 ) = 0.15
. Is this statistically different or are they the same? Is it just because we sampled more individuals in the second set that we get higher values of \(A_e\)? Consider the MP20
locus in the beetle data set, it has a total of 19 alleles present. If we subsample this data and estimate the number of observed alleles, we see that there is an asymptotic relationship between sampling effort and estimates of allelic diversity.
Redo this simulation
Here is the code for estimating frequency independent diversity, \(A\), using these data.
library( ggplot2 )
<- arapat$MP20
loci <- c(2,5,10,15,20,50,100)
sz <- rep( sz, each=20 )
sample_sizes <- rep(NA,length(sample_sizes))
Ae
for( i in 1:length(Ae)){
<- sample( loci, size=length(loci), replace=FALSE)
loci <- genetic_diversity( loci[1:sample_sizes[i]], mode="A" )$A[1]
Ae[i]
}
data.frame( N = factor(sample_sizes, ordered=TRUE),
|>
Ae) ggplot( aes(N,Ae) ) +
geom_boxplot()
The ‘curvy’ nature of this relationship shows a few things.
It takes a moderate sample size to capture the main set of alleles in the data set. If we are looking at allocating sampling using only 5 individuals per locale, then we are not going to get the majority of the alleles present.
For the rare alleles, you really need to grab large at-site samples if estimates of diversity are the main component of what you are doing. Do rare alleles aid in uncovering the biological processes you are interested in studying? They may or may not.
For most purposes, we will use all the samples we have collected. In many cases though, some locales may not have as many samples as other ones. So, even with these data, if I have one locale with 10 samples and another with 50, how can I determine if the differences observed are due to true differences in the underlying diversity and which are from my sampling? Just as in testing for HWE, we can use our new friend permutation to address the differences.
Rarefaction is the process of subsampling a larger dataset in smaller chunks such that we can estimate diversity among groups using the same number of individuals. Here is an example in the beetle dataset where I am going to look at differences in diversity among samples (\(N = 75\)) collected in the cape regions of Baja California
<- genetic_diversity( arapat[ arapat$Species=="Cape", "WNT"] )
ae.cape
ae.cape#> Locus Ae
#> 1 Locus 1.305052
and compare those to the genetic diversity observed from a smaller collection of individuals sampled from mainland Mexico (\(N=36\)).
<- genetic_diversity( arapat[ arapat$Species=="Mainland", "WNT"] )
ae.mainland
ae.mainland#> Locus Ae
#> 1 Locus 1.033889
The observed difference in effective allelic diversity, \(A_{e,mainland} == A_{e,cape}\), could be because the Cape region of Baja California is more diverse or it could be because there are twice as many individuals in that sample.
To perform a rarefaction on these data, we do the following:
Use the size of the smallest population (\(N\)) as the sample size for all estimates. Randomly sample individuals, without replacement, from the larger dataset in allocations of size \(N\).
Estimate diversity parameters on these subsamples and repeat to create a ‘null distribution’ of estimated diversity values.
Compare your observed value in the smallest population to that distribution created by subsampling the larger population.
From the data set, this is done by
<- arapat[ arapat$Species=="Cape","WNT"]
cape.pop <- rarefaction( cape.pop, mode="Ae",size=36)
null.ae mean(null.ae)
#> [1] 1.306928
So even if the samples sizes are the same, the mean level of diversity remains relatively constant. The range in diversity
range(null.ae)
#> [1] 1.000000 1.769283
is quite large. Since this estimate is frequency based, random samples of alleles change the underlying estimate of Ae during each permutation.
The observed estimate of diversity in the Mainland populations does fall within this range. However, the null hypothesis states that \(A_{e,mainland} = A_{e,cape}\) and if this is true, once we standardize sample size, we can take the distribution of permuted Ae values as a statement about what we should see if the null hypothesis were true. As such, we can treat it probabilistically and estimate the probability that \(A_e\), Mainland is drawn from this distribution.
<- c( null.ae, ae.mainland[1,1])
null.ae <- sum( ae.mainland <= null.ae ) / ( length(null.ae) )
P
P#> [1] 0.001
Or graphically, it can be depicted as below.
19.1 Mapping Diversity
Estimating diversity is great and being able to compare two or more groups for their levels of diversity is even better. But often we are looking for spatial patterns in our data. Both R and gstudio provide easy interfaces for plotting data and later in the text we will see how to integrate raster and vector data into our workflows for more sublime approaches to characterizing population genetic processes. In the mean time, it is amazingly easy to use basic plotting commands to get pretty informative output. In this example, I extract the mean coordinate of each stratum in the arapat dataset and then estimate diversity at the level of these partitions and merge the diversity estimates for the AML locus into the coordinate data.frame.
library(gstudio)
data(arapat)
<- genetic_diversity(arapat, stratum="Population", mode="Ae")
diversity <- strata_coordinates(arapat)
coords <- merge( coords, diversity[ diversity$Locus == "AML", ] ) coords
Then, I can crate a quick leaflet
map and show the diversity at the site by the size of the location marker. See ?sec-spatial-mapping for more methods in depicting the spatial location of your populations and summaries of genetic parameters.
library( leaflet )
library( tidyverse )
|>
coords mutate( radius = 1.5*Ae ) |>
leaflet() |>
addCircleMarkers(~Longitude, ~Latitude, radius=~radius) |>
addProviderTiles("OpenTopoMap")