4  Allele Frequencies & HWE

This chapter covers estimating allele and genotype frequencies, testing for Hardy-Weinberg Equilibrium, and rarefaction analysis.

library(gstudio)
data(arapat)

4.1 Allele Frequencies

The frequencies() function is an S3 generic that works on both individual locus vectors and entire data.frames.

4.1.1 Single Locus

frequencies(arapat$LTRS)
  Allele Frequency
1     01  0.523416
2     02  0.476584

4.1.2 Entire Data Frame

When applied to a data.frame, frequencies() returns frequencies for all locus columns:

freqs <- frequencies(arapat)
head(freqs, 12)
   Locus Allele   Frequency
1   LTRS     01 0.523415978
2   LTRS     02 0.476584022
3    WNT     01 0.357954545
4    WNT     02 0.181818182
5    WNT     03 0.430397727
6    WNT     04 0.026988636
7    WNT     05 0.002840909
8     EN     01 0.715277778
9     EN     02 0.180555556
10    EN     03 0.080555556
11    EN     04 0.018055556
12    EN     05 0.005555556

4.1.3 By Stratum

Add a stratum argument to partition frequencies by population:

freqs_pop <- frequencies(arapat, stratum = "Species")
head(freqs_pop, 12)
   Stratum Locus Allele  Frequency
1     Cape  LTRS     01 0.08000000
2     Cape  LTRS     02 0.92000000
3     Cape   WNT     01 0.13513514
4     Cape   WNT     02 0.86486486
5     Cape    EN     01 0.30000000
6     Cape    EN     02 0.70000000
7     Cape    EF     01 0.96666667
8     Cape    EF     02 0.03333333
9     Cape   ZMP     02 1.00000000
10    Cape   AML     03 0.09154930
11    Cape   AML     04 0.60563380
12    Cape   AML     05 0.30281690

4.1.4 Selecting Specific Loci

freqs_sub <- frequencies(arapat, loci = c("LTRS", "WNT"))
freqs_sub
  Locus Allele   Frequency
1  LTRS     01 0.523415978
2  LTRS     02 0.476584022
3   WNT     01 0.357954545
4   WNT     02 0.181818182
5   WNT     03 0.430397727
6   WNT     04 0.026988636
7   WNT     05 0.002840909

4.2 Frequency Matrix

The frequency_matrix() function produces a wide-format matrix of allele frequencies:

fm <- frequency_matrix(arapat, stratum = "Species")
fm[, 1:8]
    Stratum      AML-01      AML-02    AML-03    AML-04     AML-05     AML-06
1      Cape 0.000000000 0.000000000 0.0915493 0.6056338 0.30281690 0.00000000
2  Mainland 0.000000000 0.000000000 0.0000000 0.0000000 0.00000000 0.00000000
3 Peninsula 0.002016129 0.002016129 0.0000000 0.0000000 0.01008065 0.09879032
     AML-07
1 0.0000000
2 0.0000000
3 0.3225806

4.3 Genotype Counts and Frequencies

4.3.1 Genotype Counts per Stratum

The genotype_counts() function summarizes non-missing genotype counts per locus, optionally by stratum:

genotype_counts(arapat)
    Stratum   N LTRS WNT  EN  EF ZMP AML ATPS MP20
ALL     ALL 363  363 352 360 361 330 340  363  358
genotype_counts(arapat, stratum = "Species")
            Stratum   N LTRS WNT  EN  EF ZMP AML ATPS MP20
Cape           Cape  75   75  74  75  75  70  71   75   75
Mainland   Mainland  36   36  30  35  34  20  21   36   33
Peninsula Peninsula 252  252 248 250 252 240 248  252  250

4.3.2 Genotype Frequencies

The genotype_frequencies() function returns observed and expected genotype counts for a locus vector:

genotype_frequencies(arapat$LTRS)
  Genotype Observed  Expected
1    01:01      147  99.44904
2    01:02       86 181.10193
3    02:02      130  82.44904

4.4 Hardy-Weinberg Equilibrium

The hwe() function tests for HWE using a chi-square approximation:

mainland <- arapat[arapat$Species == "Mainland", ]
hwe(mainland)
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for LTRS this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at LTRS consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for WNT this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at WNT consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for EN this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at EN consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for EF this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at EF consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for ZMP this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at ZMP consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for AML this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at AML consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for ATPS this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at ATPS consider
collapsing alleles.
Warning in genotype_frequencies(x[[locus]]): Some genotype expectations are <
5, a continuity correction should be applied.  See ?hwe
Warning in hwe(mainland): Under 50 samples for MP20 this may not be a good
approximation.
Warning in hwe(mainland): Fewer than 5 genotypes expected at MP20 consider
collapsing alleles.
  Locus         Chi df         Prob
1  LTRS   0.9082806  1 3.405710e-01
2   WNT  29.2542373  1 6.347726e-08
3    EN  19.7817078  3 1.883729e-04
4    EF   0.3545808  1 5.515314e-01
5   ZMP   5.0927978  1 2.402540e-02
6   AML  36.8802692  6 1.858093e-06
7  ATPS  43.9698647  6 7.494505e-08
8  MP20 161.5565210 45 4.773959e-15

The output includes the chi-square statistic, degrees of freedom, and p-value for each locus. Significant p-values indicate departures from HWE expectations.

4.4.1 Interpreting Results

Departures from HWE can arise from:

  • Non-random mating (inbreeding, assortative mating)
  • Population substructure (Wahlund effect)
  • Selection at or near the marker locus
  • Small population size (genetic drift)

4.5 Rarefaction

The rarefaction() function subsamples the data repeatedly at a smaller sample size to estimate the distribution of a diversity statistic:

rare_vals <- rarefaction(arapat$LTRS, mode = "Ae",
                         size = 20, nperm = 199)
hist(rare_vals, main = "Rarefied Ae at LTRS",
     xlab = "Effective Alleles", col = "steelblue")

This is useful for comparing diversity across populations with unequal sample sizes.

4.6 Optimal Sampling

The optimal_sampling() function estimates how many individuals are needed to capture a given proportion of allelic diversity:

optimal_sampling(arapat$LTRS, nrep = 99)