An Introduction to the gstudio Package

Package Overview

The gstudio package is a package created to make the inclusion of marker based population genetic data in the R workflow easy. An underlying motivation for this package is to provide a link between spatial analysis and graphing packages such that the user can be quickly and easily manipulate data in exploratory ways that aid in gaining biological inferences.

Installing the package

This package requires several other packages for installation. By default, the install should be easily accomplished using the build-in functionalities in R.

install.packages("gstudio")

Occasionally, you should look to see if there are updates to the package by doing the following (this will update all packages you have installed)

update.packages(ask = FALSE)

If you want the most recent version of this package, I make the development builds available on Github (http://github.com/dyerlab/). You can install directly from within R as:

install.packages(devtools)
require(devtools)
install_github("gstudio", "dyerlab", ref = "develop")

I recommend using the latest version, it has a lot of the newer features and I do not check it into github until it has been tested. I only post to CRAN when major versions change.

Loading the Package

And any time you need to use the package, you would just pull it into your session.

require(gstudio)

This should get you everything you need. The gstudio package does contain a lot of build-in documentation including a lot of examples. All the functions and examples associated with them can be found in the build-in documents available from any of the following (n.b., the vignette is a simple placeholder with links to the website.)

help(package = "gstudio")
vignette("gstudio")
system("open http://dyerlab.github.io/gstudio/")

At any point if you have any questions about the values or options for a particular function in gstudio or any other package, you can use the help.search(func) or ? functionality. This file is also kept in sync with the development of the gstudio package (it is in the source package that you downloaded from CRAN) and will serve as a tutorial for your use of this package. If there are any questions that you may have regarding this package, feel free to contact Rodney Dyer and I will get back to you as soon as possible.

Genetic Data

The overriding philosophy behind the gstudio package is to make it as easy as possible to create, load, use, and integrate, genetic marker data into your analysis workflow. As such, we typically use data.frame objects to hold our data and the addition of the locus class as a fundamental data type allows us to continue to do so.

x <- locus(c(1, 2))
class(x)

## [1] "locus"

## [1] "1:2"

You can think of a locus object as a vector of alleles. There are several options you can use when constructing a locus object based upon what kind of marker data you are using. These options are passed through the type option to the locus() function. Here are the current options.

type Option | Function ————- | ——— | This is the default value (e.g., nothing passed). It will treat the values passed to locus as alleles to a single locus snp | Alleles are ‘0’, ‘1’, or ‘2’ indicating the number of minor alleles. zyme | Genotypes are encoded as “12” like zymes (e.g., “1” & “2” alleles together). separated | Alleles are already separated by “:” character (for putting in polyploid data) column | Alleles for a single locus are in two separate columns.

Here is some examples.

loc0 <- locus()
loc1 <- locus(1)
loc2 <- locus(c("C", "A"), phased = TRUE)
loc3 <- locus(c("C", "A"))
loc4 <- locus(12, type = "zyme")
loc5 <- locus(1:4)
loc6 <- locus("A:C:C:G", type = "separated")
loci <- c(loc0, loc1, loc2, loc3, loc4, loc5, loc6)
loci

## [1] NA        "1"       "C:A"     "A:C"     "1:2"     "1:2:3:4" "A:C:C:G"

Notice how the printing of each locus object uses the colon character to separate alleles. Also, since the locus object is a basic data type, it can be used in other data structures.

df <- data.frame(ID = 0:6, Loci = loci)
df

##   ID    Loci
## 1  0        
## 2  1       1
## 3  2     C:A
## 4  3     A:C
## 5  4     1:2
## 6  5 1:2:3:4
## 7  6 A:C:C:G

And you can use normal vector processing of the locus vector to do normal R data like operations.

is.na(df$Loci)

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

is_heterozygote(df$Loci)

## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Importing Data

Importing data is not that difficult. People tend to keep their data in either spreadsheets or text files, which are easily accessible via R, or some archane program format (wave to Arlequin everyone), which may be less accessable. As of version 1.1, gstudio handles data both in raw format (e.g., as they appear in your spreadsheet) or in GENEPOP format. All of these formats are accessed in gstudio using the function, read_population() that is a mix between the traditional read.table() function and the locus() functions.

Import from GENEPOP files

The GENEPOP file format, as interpreted by gstudio has the following format.

It is a text file and preferrably space delimited.

The first line has some info for you, it is mostly ignored but should include a bit about what the data are.
The next $K$ lines of the file list the names of the loci to be used, one per line.
The rest of the file contains populations. Each population starts with “Pop” alone on its own line. All individuals are assumed to be in the same population until such time the next “Pop” is reached.
For each individual, there is an identification value first, and followed by a comma are the genotypes.
All genotypes are encoded using 3 digits for each allele (e.g., a 3:5 genotype is 003005). Missing data is all zeros ‘000000’ and haploid is just three digits.

The import from read_population() names loci, adds “ID” for the identification column and adds to each population a “Pop-X” designation. Other than that, it is identical to what is below. However, if you have your data in a spreadsheet, there is no need to shove it into a genepop format to import into R.

Text File Import

Options for this function include:

Option	Function
path	The full path to the text file
type	The locus type (see above).
locus.columns	A vector of numbers of columns to be treated as `Locus()` objects
sep	The character used to separate columns (‘,’ is default)
header	Columns have header names (e.g., locus names, etc.)

Here are some examples of data files with different kinds of genetic data, each of which exercises the read_population() function in a different way. Hopefully this covers the main types of data being imported, if not, drop me an email Rodney Dyer. Missing genotypes should be missing data or encoded as NA. If you do not have a genotype then leave it blank. There is no reason to use negative numbers or other conventions.

Columns of genotypes are indicated by the required parameter locus.columns so that read_population() knows which columns to treat as locus objects and which to leave as normal data for R. Without this parameter, the data will be read in as character or numeric.

There are some example data files included in the project for you to look at. Depending upon how your computer is set up, they may be placed in different locations. Here is a quick way to find out where the installed folder is for the gstudio package and the location of the ‘data’ folder within it.

system.file("extdata", package = "gstudio")

## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/gstudio/extdata"

Two Column Data

Here is an example of data where the genotypes are encoded as two columns of data in a csv file.

file <- system.file("extdata", "data_2_column.csv", package = "gstudio")
data <- read_population(file, type = "column", locus.columns = 4:7)
data

##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 12:14 15:15
## 2 Bellingham    48.75   -122.49 12:12 15:17
## 3  St. Louis    38.81    -89.98 10:12      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

Phased Data

There are times when the gametic phase of the genotypes is important. By default, gstudio will keep alleles sorted in alpha/numeric order. If you need to keep this from happening, pass the optional phased option to read_population(). Notice the differences between this and the previous genotypes.

file <- system.file("extdata", "data_2_column.csv", package = "gstudio")
data <- read_population(file, type = "column", locus.columns = 4:7, phased = TRUE)
data

##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 14:12 15:15
## 2 Bellingham    48.75   -122.49 12:12 17:15
## 3  St. Louis    38.81    -89.98 12:10      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

AFLP-like data

Genotypes that are ‘aflp’-like are encoded as binary characters (e.g., 0/1) indicating the presence or absence of a particular band.

file <- system.file("extdata", "data_aflp.csv", package = "gstudio")
data <- read_population(file, type = "aflp", locus.columns = c(4, 5))
data

##   Population Latitude Longitude Loc1 Loc2
## 1    Olympia    47.15   -122.89    1    1
## 2 Bellingham    48.75   -122.49    0    0
## 3  St. Louis    38.81    -89.98    1    0
## 4       Ames    42.26    -93.47    0    1
## 5   Richmond    37.74    -77.16    0    0

SNP Minor Allele Data

At times, SNP data is encoded in relation to the number of minor alleles. You can import these data using the type=“snp” option and it will encode them as ‘AA’, ‘AB’, or ‘BB’ with the ‘B’ allele as the minor one.

file <- system.file("extdata", "data_snp.csv", package = "gstudio")
data <- read_population(file, type = "snp", locus.columns = 4:7)
data

##   Population Latitude Longitude snp1 snp2 snp3 snp4
## 1    Olympia    47.15   -122.89  A:B  B:B  A:B  A:B
## 2 Bellingham    48.75   -122.49  A:B  A:A  A:B  B:B
## 3  St. Louis    38.81    -89.98  A:B  A:A          
## 4       Ames    42.26    -93.47  A:A  A:B          
## 5   Richmond    37.74    -77.16  A:A  B:B  A:B  B:B

Zyme-Like Data

Some data is encoded as allozyme genotypes (e.g., 33, 35, 55 for diploid individuals with alleles ‘3’ and ‘5’).

file <- system.file("extdata", "data_zymelike.csv", package = "gstudio")
data <- read_population(file, type = "zyme", locus.columns = 4:7)
data

##   Population Latitude Longitude Loc1 Loc2 Loc3 Loc4
## 1    Olympia    47.15   -122.89  1:2  1:4  1:5  1:5
## 2 Bellingham    48.75   -122.49  1:2  1:2  1:5  1:7
## 3  St. Louis    38.81    -89.98  1:2  1:3          
## 4       Ames    42.26    -93.47  1:1  1:4          
## 5   Richmond    37.74    -77.16  1:2  1:1  2:2  2:2

Pre-Separated Data For Higher Ploidy

file <- system.file("extdata", "data_separated.csv", package = "gstudio")
data <- read_population(file, type = "separated", locus.columns = c(4, 5))
data

##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 12:14 15:15
## 2 Bellingham    48.75   -122.49 12:12 15:17
## 3  St. Louis    38.81    -89.98 12:10      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

Saving Data

There are several ways to export your data to file.

Raw R objects

Saving data once it is in R is trivial and you do it as you would for any other R object. The R object system knows how to serialize its own data using the load() and save() functions.

save(df, file = "MyData.rda")

To load the objects back into the work space, you just do:

load("MyData.rda")

And you can verify that you have data in your work space by listing it.

ls()

##  [1] "data"      "df"        "file"      "loc0"      "loc1"     
##  [6] "loc2"      "loc3"      "loc4"      "loc5"      "loc6"     
## [11] "loci"      "metadata"  "tidy.opts" "x"

Saving as Text

As a default, the function write_population() will write your data file as a comma separated text file with the loci encoded as column separated (see type="separated" above).

write_population(df, file = "~/Desktop/MyData.csv")

Saving as GENEPOP

Raw data can be saved in GENEPOP formats by passing an optional argument mode="genepop" to the write_population() function.

write_population(df, file = "~/Desktop/MyData.txt", mode = "genepop")

Saving for STRUCTURE

Raw data can also be exported for analysis using the program STRUCTURE. Here the optional argument is type="structure"

write_population(df, file = "~/Desktop/MyData.str", mode = "structure")

Data sets in the Package

The main genetic data included with the package if from the Sonoran desert bark beetle, Araptus attenuatus from the Dyer laboratory. You can load it into your work space by:

data(arapat)
summary(arapat)

##       Species      Cluster      Population        ID         Latitude   
##  Cape     : 75   CBP-C :150   32     : 19   101_10A:  1   Min.   :23.1  
##  Mainland : 36   NBP-C : 84   75     : 11   101_1A :  1   1st Qu.:24.6  
##  Peninsula:252   SBP-C : 18   Const  : 11   101_2A :  1   Median :26.2  
##                  SCBP-A: 75   12     : 10   101_3A :  1   Mean   :26.3  
##                  SON-B : 36   153    : 10   101_4A :  1   3rd Qu.:27.5  
##                               157    : 10   101_5A :  1   Max.   :29.3  
##                               (Other):292   (Other):357                 
##    Longitude        LTRS          WNT            EN           EF     
##  Min.   :-114   01:01 :147   03:03  :108   01:01  :225   01:01 :219  
##  1st Qu.:-113   01:02 : 86   01:01  : 82   01:02  : 52   01:02 : 52  
##  Median :-112   02:02 :130   01:03  : 77   02:02  : 38   02:02 : 90  
##  Mean   :-112                02:02  : 62   03:03  : 22   NA's  :  2  
##  3rd Qu.:-110                03:04  :  8   01:03  :  7               
##  Max.   :-109                (Other): 15   (Other): 16               
##                              NA's   : 11   NA's   :  3               
##      ZMP           AML           ATPS          MP20    
##  01:01 : 46   08:08  : 51   05:05  :155   05:07  : 64  
##  01:02 : 51   07:07  : 42   03:03  : 69   07:07  : 53  
##  02:02 :233   07:08  : 42   09:09  : 66   18:18  : 52  
##  NA's  : 33   04:04  : 41   02:02  : 30   05:05  : 48  
##               07:09  : 22   07:09  : 14   05:06  : 22  
##               (Other):142   08:08  :  9   (Other):119  
##               NA's   : 23   (Other): 20   NA's   :  5

You can see several things here. First, the locus data are displayed by genotype counts including NA values where there was missing data. A column of type locus is just like any other kind of variable and can be used as such. This opens up a lot of functionality for you to be able to treat marker data just like everything else in R.

Convenience Functions

Dealing easily with parts of your data is a critical skill and a huge benefit in using a grammar like R to do your analyses. In R, data.frame objects are almost like little databases and you can do some really creative manipulations with them. The gstudio package provides a few things that may help you work more efficiently with your data.

Data Classes

Often it is important to know which columns of a data set are actually of a particular data type. Here is a simple function that tells you either the name or index of columns in a data.frame that of a specific data type.

column_class(arapat, class = "locus")

## [1] "LTRS" "WNT"  "EN"   "EF"   "ZMP"  "AML"  "ATPS" "MP20"

column_class(arapat, class = "locus", mode = "index")

## [1]  7  8  9 10 11 12 13 14

–

Partitioning

The partition() function takes a data frame and returns a list of data frames, partitioned by the stratum you pass to it. This is really nice if you are doing a nested analysis of sorts and want to work with subsets of your data that are defined by a categorical factor variable.

names(arapat)

##  [1] "Species"    "Cluster"    "Population" "ID"         "Latitude"  
##  [6] "Longitude"  "LTRS"       "WNT"        "EN"         "EF"        
## [11] "ZMP"        "AML"        "ATPS"       "MP20"

clades <- partition(arapat, stratum = "Species")
names(clades)

## [1] "Cape"      "Mainland"  "Peninsula"

This kind of partitioning is very common in the analysis of spatial genetic structure and as such should be as simple as possible to provide the most flexibility to you, the analyst. One of the common analysis patterns that you will come back to over and over again is to partition the entire data set and perform operations on each of the subgroups. In R this is a pretty easy process if you look into the lapply() function (and its relatives). This is such an important component, that I’m going to spend a little time here to make sure you understand what I am doing. Once you get it, it will make you life tremendously more awesome!

The basic form of the various ‘apply’ functions is that you pass them some data and a function on which it will take each part of that data and apply it. For lists (the l part in lapply()), the function will take each entry in the list and pass it along to the function. The function itself can be one that is already available (like length() or is.na() or something) or it can be something you specify directly, on the fly. Here is an example looking at the number of samples in each ‘Species’ as partitioned above.

lapply(clades, dim)

## $Cape
## [1] 75 14
## 
## $Mainland
## [1] 36 14
## 
## $Peninsula
## [1] 252  14

Here the dim() function returns the dimension of the data.frame in each clade, all have the same number of columns, but differ in the number of individuals (the rows). While this is a stupid example (you could get the same thing from dim(arapat$Species) but it shows the general pattern.

Where these functions come in handy is in creating your own functions. These can be small functions that do something specific and can be shoved into the lapply() function directly. This gives a lot of leeway in what you can do in R and is a very powerful way of understanding data. Here is an example, say we want to know the number of individuals of each “Species” found in each population (some are in sympatry in the Cape Region of Baja California). The general approach will be:

Partition on population

Make a table of Species counts for each population.
Take resulting list and turn it into a matrix (use the unlist() function) with three columns (the byrow=TRUE is key here)
Add row and column labels

pops <- partition(arapat, stratum = "Population")
counts <- lapply(pops, function(x) {
    return(table(x$Species))
})
m <- matrix(unlist(counts), ncol = 3, byrow = TRUE)
rownames(m) <- names(pops)
colnames(m) <- levels(arapat$Species)
m[1:10, ]

##     Cape Mainland Peninsula
## 101    0        9         0
## 102    0        8         0
## 12     0        0        10
## 153    0        0        10
## 156    6        0         0
## 157    8        0         2
## 159    0        0         9
## 160    0        0        10
## 161    0        0        10
## 162    0        0        10

(only the first 10 rows are displayed). You will see this kind of use of the lapply() throughout the rest of this document.

Plotting Populations

One of the key benefits to using an analysis environment such as R is that you can mash together functionality that you just can’t get from a monolithic program. An example of this is plotting populations. If your data has spatial coordinates in them then you can use this to plot the location of your sites on a GoogleMap tile. By default, you must have your coordinates in decimal degree format. This is the default for the GoogleMaps API. Moreover, if you name the columns of data “Longitude” and “Latitude” much of the spatial functionality in gstudio will be more transparent (if not, you have to specify the Longitude and Latitude names each time you use a function that needs them).

require(ggplot2)
plot(arapat)

plot of chunk unnamed-chunk-28

There are other options you can pass to the plot function, see “?plot.data.frame” for more information. If you need to get more fancy than just a plot, you can do the plotting manually. Lets say I want to plot the different Species in the data set as different colors. The following are the steps I would use to create some plots coloring different clusters.

Find coordinates for each population and keep only the unique ones.

Determine the centroid of this area.
Request a map tile from Google (this needs internet connections obviously)
Turn the tile into a data.frame
Create a plot (I favor ggplot2 as a plotting paradigm)

require(ggmap)
data <- unique(arapat[, c(2, 3, 6, 5)])
centroid <- apply(data[, 3:4], 2, mean)
centroid

## Longitude  Latitude 
##   -111.40     25.66

map <- population_map(data)
ggmap(map) + geom_point(aes(x = Longitude, y = Latitude, color = Cluster), data = data, 
    size = 4)

Allele Frequencies

Another object made easily in gstudio are objects related to allele frequencies. Allele frequencies are just like every other kind of data and can be extracted from a data.frame containing locus objects using the function frequencies().

Single Locus

Grabbing allele frequencies is a fundamental task for any population genetic analysis and should be as easy as possible. Here are some examples of various ways to get allele frequency information using the Araptus attenuatus data set.

freq.EF <- frequencies(arapat$EF)
class(freq.EF)

## [1] "data.frame"

freq.EF

##   Allele Frequency
## 1     01    0.6787
## 2     02    0.3213

Multilocus

The conversion of loci to a data.frame expands beyond the single locus. If you do not specify which locus to use, it will use all locus objects and add an additional column to the data frame (n.b., I only print out the first 10 rows to give you the idea).

freqs.loci <- frequencies(arapat)
freqs.loci[1:10, ]

##    Locus Allele Frequency
## 1   LTRS     01  0.523416
## 2   LTRS     02  0.476584
## 3    WNT     01  0.357955
## 4    WNT     02  0.181818
## 5    WNT     03  0.430398
## 6    WNT     04  0.026989
## 7    WNT     05  0.002841
## 8     EN     01  0.715278
## 9     EN     02  0.180556
## 10    EN     03  0.080556

Substrata and Allele Frequencies

To complete the symmetry here, adding stratum to the analysis, provides yet another categorical variable upon which allele frequencies may be estimated. Here is an example looking at the “Cluster” strata in the data set and a partial printout of the results.

freqs.strata <- frequencies(arapat, stratum = "Cluster")
freqs.strata[1:10, ]

##    Stratum Locus Allele Frequency
## 1    CBP-C  LTRS     01  0.630000
## 2    CBP-C  LTRS     02  0.370000
## 3    CBP-C   WNT     01  0.287162
## 4    CBP-C   WNT     03  0.641892
## 5    CBP-C   WNT     04  0.064189
## 6    CBP-C   WNT     05  0.006757
## 7    CBP-C    EN     01  0.983108
## 8    CBP-C    EN     03  0.003378
## 9    CBP-C    EN     04  0.013514
## 10   CBP-C    EF     01  0.290000

Plotting Allele Frequencies

There are several ways you may want to graphically view the locus data and for convenience, the gstudio package provides some interfaces for nice plots using the ggplot2 package.

Plotting a vector of loci will by default estimate the frequencies of each allele for graphical output. There are two different output for this (n.b., a pie chart by its nature can lead to inaccurate interpretations and most statisticians hate them).

plot(arapat$MP20)

plot of chunk unnamed-chunk-29

plot(arapat$MP20, mode = "pie")

plot of chunk unnamed-chunk-29

You can also use the ggplot2 routine geom_locus() to plot the frequencies:

ggplot() + geom_locus(aes(x = MP20, fill = Cluster), data = arapat)

plot of chunk unnamed-chunk-30

The frequencies across a collection of loci can easily be plot just as well (internally, this simple plot is just turns the object into a data frame and then plots it). At times, examination of allele spectra can reveal blatant differences in substratum of your data. For example, consider the following spectrum for the locus MP20.

f <- freqs.strata[freqs.strata$Locus %in% c("MP20", "AML"), ]
summary(f)

##    Stratum             Locus              Allele            Frequency     
##  Length:52          Length:52          Length:52          Min.   :0.0034  
##  Class :character   Class :character   Class :character   1st Qu.:0.0179  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.0909  
##                                                           Mean   :0.1923  
##                                                           3rd Qu.:0.2861  
##                                                           Max.   :0.9167

ggplot(f) + geom_frequencies(f) + facet_grid(Stratum ~ .) + theme(legend.position = "none")

plot of chunk unnamed-chunk-31

Frequency Gradients

When you have many strata or you are conducting landscape-level analyses, it is often helpful to look at how allele frequencies change in relation to some variable other than stratum.

baja <- arapat[arapat$Species != "Mainland", ]

The EN locus has a few different alleles but if we look at the frequencies of each, the first two dominate.

plot(baja$EN)

plot of chunk unnamed-chunk-33

Using just the first allele 01, it is pretty easy to plot the strata frequency as a function of latitude using normal R approaches. To do this, one needs to:

Extract the 01 allele frequencies by population.

freqs <- frequencies(baja, stratum = "Population", loci = "EN")
freq.01 <- freqs[freqs$Allele == "01", ]

Merge this data.frame with one containing the coordinates of the populations.

coords <- strata_coordinates(baja)
df <- merge(freq.01, coords)
df[1:10, ]

##    Stratum Locus Allele Frequency Longitude Latitude
## 1       12    EN     01    1.0000    -112.7    27.18
## 2      153    EN     01    1.0000    -110.5    24.13
## 3      156    EN     01    0.1667    -110.0    24.04
## 4      157    EN     01    0.4500    -110.1    24.02
## 5      159    EN     01    0.7222    -113.3    27.53
## 6      160    EN     01    1.0000    -112.5    27.40
## 7      161    EN     01    1.0000    -113.0    27.04
## 8      162    EN     01    0.9500    -112.4    27.20
## 9      163    EN     01    0.7000    -111.0    24.21
## 10     164    EN     01    0.8500    -111.5    24.75

Now, you can plot the frequencies as either a linear plot (below you will see how to plot these along environmental gradients).

ggplot(df, aes(x = Latitude, y = Frequency)) + geom_line(linetype = 2) + geom_point(size = 4)

plot of chunk unnamed-chunk-36

This is interesting. Now, just to ‘kick it up a notch’ I’m going to look at the Cluster variable. This is from mtDNA and shows punitively cryptic species. I’m going to remake the plot above but color the points to indicate the presence of the ‘SCBP-A’ clade (perhaps another species). Below I grab add a new column of data to df and then make it all ‘Baja’. Then I figure out which populations have ‘SCBP-A’ individuals in it.

df$Species <- "Baja"
pops.with.scbp <- as.character(unique(baja$Population[baja$Cluster == "SCBP-A"]))
df$Species[df$Stratum %in% pops.with.scbp] <- "Cape"

Then plot it.

ggplot(df, aes(x = Latitude, y = Frequency)) + geom_line(linetype = 2) + geom_point(size = 5, 
    aes(color = Species))

plot of chunk unnamed-chunk-38

I think this leads to some interesting questions about the relationship between potential species differences, where species are gauged by mtDNA, in nuclear allelic diversity.

Spatial Frequency Plots

It is also possible to plot the data in a spatial context. Here is an example of how to mix ggplot() and ggmap() data and I’ll plot the locations as proportional in size to the allele frequency.

map <- population_map(baja)
ggmap(map) + geom_point(aes(x = Longitude, y = Latitude, size = Frequency), 
    data = df)

plot of chunk unnamed-chunk-39

There is also the option to make use of some pie charts. I know, pie charts suck and any statistician will tell you that they should probably not be used because they can be misleading, but here they are. For exploratory data analysis, they can be very insightful at times. Here is the frequency of alleles at the enolase locus in Araptus. Any spatial structuring catch your eye?

pies_on_map(arapat, stratum = "Population", locus = "EN")

plot of chunk unnamed-chunk-40

Note the messages about the approximation. This is because the google maps API has an integer for zoom factor and at times it is not able to get all the points into the field of view using an integer zoom. If this happens to you, you can manually specify the zoom as an optional argument to either function pies_on_map() or population_map(). You also need to be careful with the pies_on_map() function because the way it works is that the background tile is plotted and then I plot the pies ontop of it. If you reshape your plot window outside equal x- and y- coordinates (e.g., make it a non-square figure), the spatial location of the pie charts will move! This is a very frustrating thing but it has to do with the way viewports are overlain onto graphical objects in R and I have absolutely no control over it. So, the best option is to make the plot square, export it as PNG or TIFF or whaterver, then crop as necessary.

Multivariate Analogs for Loci

Genotype data is inherently multivariate. In fact, it is multinomial multivariate senus stricto but we generally ignore that. That being said, we can easily translate raw genotypes into raw multivariate encodings for other statistical analyses. Here is a quick example using a few individuals and the WNT locus.

to_mv(arapat$WNT[1:10])

##        01  03
##  [1,] 0.0 0.0
##  [2,] 0.5 0.5
##  [3,] 0.5 0.5
##  [4,] 1.0 0.0
##  [5,] 0.5 0.5
##  [6,] 1.0 0.0
##  [7,] 1.0 0.0
##  [8,] 1.0 0.0
##  [9,] 0.5 0.5
## [10,] 1.0 0.0

to_mv(arapat$WNT[1:10], drop.allele = TRUE)

##  [1] 0.0 0.5 0.5 1.0 0.5 1.0 1.0 1.0 0.5 1.0

For multiple loci, we can use the same approach. Here is an example of a PCA analysis done on raw genotypes.

x <- to_mv(arapat, drop.allele = TRUE)
fit.pca <- princomp(x, cor = TRUE)
summary(fit.pca)

## Importance of components:
##                        Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6
## Standard deviation     2.6638 2.17988 1.81948 1.69809 1.30595 1.25288
## Proportion of Variance 0.1419 0.09504 0.06621 0.05767 0.03411 0.03139
## Cumulative Proportion  0.1419 0.23695 0.30316 0.36083 0.39494 0.42634
##                         Comp.7  Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation     1.21993 1.20493 1.1875 1.14296 1.11209 1.09679
## Proportion of Variance 0.02976 0.02904 0.0282 0.02613 0.02473 0.02406
## Cumulative Proportion  0.45610 0.48514 0.5133 0.53947 0.56420 0.58826
##                        Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18
## Standard deviation     1.08379  1.0606 1.03894 1.02449 1.02114  1.0099
## Proportion of Variance 0.02349  0.0225 0.02159 0.02099 0.02085  0.0204
## Cumulative Proportion  0.61175  0.6342 0.65584 0.67683 0.69768  0.7181
##                        Comp.19 Comp.20 Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation     0.99798 0.99608 0.95870 0.93686  0.9164 0.91262
## Proportion of Variance 0.01992 0.01984 0.01838 0.01755  0.0168 0.01666
## Cumulative Proportion  0.73800 0.75784 0.77623 0.79378  0.8106 0.82723
##                        Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation     0.89069 0.86952 0.86619 0.81903 0.79160 0.77847
## Proportion of Variance 0.01587 0.01512 0.01501 0.01342 0.01253 0.01212
## Cumulative Proportion  0.84310 0.85822 0.87323 0.88664 0.89917 0.91130
##                        Comp.31 Comp.32  Comp.33  Comp.34  Comp.35  Comp.36
## Standard deviation     0.73738 0.71024 0.690065 0.670553 0.634367 0.622845
## Proportion of Variance 0.01087 0.01009 0.009524 0.008993 0.008048 0.007759
## Cumulative Proportion  0.92217 0.93226 0.941783 0.950775 0.958824 0.966583
##                         Comp.37  Comp.38  Comp.39  Comp.40  Comp.41
## Standard deviation     0.575374 0.536515 0.508130 0.451603 0.431085
## Proportion of Variance 0.006621 0.005757 0.005164 0.004079 0.003717
## Cumulative Proportion  0.973204 0.978961 0.984125 0.988203 0.991920
##                         Comp.42 Comp.43 Comp.44 Comp.45   Comp.46
## Standard deviation     0.355832 0.29327 0.25395 0.23239 0.1725981
## Proportion of Variance 0.002532 0.00172 0.00129 0.00108 0.0005958
## Cumulative Proportion  0.994452 0.99617 0.99746 0.99854 0.9991383
##                          Comp.47   Comp.48   Comp.49   Comp.50
## Standard deviation     0.1401187 0.1114153 0.0914260 5.178e-02
## Proportion of Variance 0.0003927 0.0002483 0.0001672 5.362e-05
## Cumulative Proportion  0.9995309 0.9997792 0.9999464 1.000e+00

Interesting. It takes several eigenvectors to explain these data sufficiently. Here is a simple plot of some of model, given by computing the predicted values for each sample. I then use ggplot() to make a scatter plot with clade dictating the shape of the symbol (each symbol is an individual and clade was determined by mtDNA, not these data), and Clade to provide the color.

pred <- predict(fit.pca)
df <- data.frame(PC1 = pred[, 1], PC2 = pred[, 2], Species = arapat$Species, 
    Clade = arapat$Cluster, Pop = arapat$Population)
ggplot(df) + geom_point(aes(x = PC1, y = PC2, shape = Species, color = Clade), 
    size = 3, alpha = 0.75)

plot of chunk unnamed-chunk-43

Looks like there are three main groups divided by clade and within the more dense clade, there is some sub-structuring. I’ll take the data that is in the main clade and do a quick hierarchical clustering analysis.

baja <- pred[df$Species == "Peninsula", ]
h <- hclust(dist(baja), method = "single")
plot(h, main = "Main Baja California Clade", xlab = "")

plot of chunk unnamed-chunk-44

Measures of Genetic Diversity

Genetic diversity is estimated by several different means. It can be estimated at several different levels; at individuals, at groups, at populations, etc. It can also be estimated by several different parameters. This section covers some of the more common parameters used for quantifying genetic diversity.

Allelic Diversity

At the most basic level, the number of alleles within a group of individuals is a base measure of diversity. However, there are some caveats to be made about the way in which we count alleles. Rare alleles may or may not be as informative. The three ways commonly used to look at allelic diversity are

The total number of alleles ($A$).
The effective number of alleles ($A_e$).
The number of alleles with at least 5% frequency ($A_{95}$).

These parameters are estimated from your data using the genetic_diversity() function. The argument mode takes either “A” (the default), “Ae”, or “A95” to differentiate.

AA <- locus(c("A", "A"))
AB <- locus(c("A", "B"))
loci <- c(AA, AB, AA, AA, AA, AA, AA, AA, AA, AA, AA)
loci

##  [1] "A:A" "A:B" "A:A" "A:A" "A:A" "A:A" "A:A" "A:A" "A:A" "A:A" "A:A"

genetic_diversity(loci)

##      Ae
## 1 1.095

genetic_diversity(loci, mode = "Ae")

##      Ae
## 1 1.095

genetic_diversity(loci, mode = "A95")

## [1] 1

Rarefaction

Rarefaction is a technique used to measure diversity in different populations. It is particularly important for situations where you have different sample sizes. Is there more diversity in the larger population because you sampled more or is it a truly more diverse population? I’ll use the data from the beetle to show how diversity changes with sample sizes and highlight how you can use the rarefaction() function.

In the mainland populations, there are only 36 samples and the allelic diversity is relatively low at the WNT locus.

loci.son <- arapat$WNT[arapat$Species == "Mainland"]
length(loci.son)

## [1] 36

genetic_diversity(loci.son, mode = "Ae")

##      Ae
## 1 1.034

The larger clade on the peninsula has many more individuals and is more diverse.

loci.peninsula <- arapat$WNT[arapat$Species == "Peninsula"]
length(loci.peninsula)

## [1] 252

genetic_diversity(loci.peninsula, mode = "Ae")

##      Ae
## 1 2.025

So is this difference a consequence of the sample sizes or is peninsular Baja California more genetically diverse? To answer this, rarefaction randomly sub-samples the loci.baja data and estimates the value of $\hat{A}_e$ for samples of size 36 (for our case) to see if the observed diversity differences are due to sampling alone. To visualize the distribution, I throw it into a data.frame and use the ggplot2 functions to make the pretty colored histogram.

Ae.peninsula <- rarefaction(loci.peninsula, mode = "Ae", size = 36)
df <- data.frame(Ae.peninsula)
ggplot(df, aes(x = Ae.peninsula)) + geom_histogram(aes(fill = ..count..), binwidth = 0.05) + 
    scale_fill_gradient("count", low = "#cccccc", high = "#a60000") + theme_bw()

plot of chunk hist-Ae

Heterozygosity

At a base level, heterozygosity is a form of diversity (see Nei). Heterozygosity can be measured at many different stratum and in two forms. All of these approaches can be accessed through the functions Ho() (for observed heterozygosity)

Ho(arapat$EF)

##    Ho 
## 0.144

and He() (for expected heterozygosity given Hardy Weinberg Equilibrium).

He(arapat$EF)

## [1] 0.4362

Both $H_E$ and $H_O$ can be used with full data sets as well. When you pass a full data.frame to these functions, it will return a data.frame with loci by row.

He(arapat)

##   Locus     He
## 1  LTRS 0.4989
## 2   WNT 0.6528
## 3    EN 0.4489
## 4    EF 0.4362
## 5   ZMP 0.3394
## 6   AML 0.8294
## 7  ATPS 0.7194
## 8  MP20 0.8186

Given the broadness of these functions, it is easy to integrate them into broader analyses. Here is an example of expected heterozygosity ($H_E$ or genetic diversity in Nei’s terms) as a function of latitude for the peninsular populations. The output is displayed as a plot.

baja <- arapat[arapat$Species != "Mainland", ]
coords <- strata_coordinates(baja)
pops <- partition(baja, stratum = "Population")
he <- lapply(pops, function(x) return(He(x$EN, small.sample.correction = TRUE)))
data <- merge(coords, data.frame(Stratum = names(he), He = unlist(he)))
data <- data[order(data$Latitude), ]
ggplot(data, aes(x = Latitude, y = He)) + geom_line(linetype = 2) + geom_point(color = "red", 
    size = 4)

plot of chunk unnamed-chunk-51

Inbreeding

Inbreeding is a consequence of mating patterns and/or demographic population size. The consequences of inbreeding are related to how alleles are put into genotypes. One approach to looking at inbreeding is to estimate the expected frequency of heterozygotes (e.g., the $2pq$ part of the classic Hardy-Weinberg equation) and compare it to the observed level of inbreeding. This is the classic $F$ statistic and is estimated as:

\[ F_{IS} = 1 - \frac{H_{O}}{H_{E}} \]

Using the beetle data again, from the maps above we see three mainland populations and there is a good reason to believe that these are a separate species (see Garrick et al. 2013). These populations are small and isolated and as such may experience inbreeding.

sonora <- arapat[arapat$Species == "Mainland", ]
fis.sonora <- Fis(sonora)
fis.sonora

##   Locus      Fis
## 1  LTRS -0.21549
## 2   WNT -0.01695
## 3    EN  0.54545
## 4    EF -0.19298
## 5   ZMP  1.00000
## 6   AML  0.73474
## 7  ATPS  0.33436
## 8  MP20  0.17359

There are various ways to get a confidence interval on these kind of analyses. In what follows is an implicit test of $H_O: F_{IS} = 0$ using permutation. If that Null hypothesis is true, then any random permutation of alleles (combined into genotypes) sampled from this population would produce estimates of $F_{IS}$ as large as that observed. This kind of permutation is handled by the function permute_ci() (though it can be applied to more complicated analyses as shown below). Here is an example of how to use it to create the null distribution of $F_{IS}$ values given these data for the loci $EN$, $AML$, and $ATPS$.

locus.en <- sonora$EN[!is.na(sonora$EN)]
locus.aml <- sonora$EN[!is.na(sonora$AML)]
locus.atps <- sonora$EN[!is.na(sonora$ATPS)]
fis.en <- permute_ci(locus.en, FUN = Fis, nperm = 99)
fis.aml <- permute_ci(locus.aml, FUN = Fis, nperm = 99)
fis.atps <- permute_ci(locus.atps, FUN = Fis, nperm = 99)

Now we can plot these as histograms to look at their distributions.

df <- data.frame(Locus = rep(c("EN", "AML", "ATPS"), each = length(fis.en)), 
    Fis = c(fis.en, fis.aml, fis.atps))
fis.sonora

##   Locus      Fis
## 1  LTRS -0.21549
## 2   WNT -0.01695
## 3    EN  0.54545
## 4    EF -0.19298
## 5   ZMP  1.00000
## 6   AML  0.73474
## 7  ATPS  0.33436
## 8  MP20  0.17359

ggplot(df, aes(x = Fis)) + geom_histogram(binwidth = 0.025) + facet_grid(Locus ~ 
    .) + theme_bw()

plot of chunk unnamed-chunk-54

Estimating a confidence interval around a point estimate is a bit different. In the example above, we could ask if $F_{IS,EN}=$ {r}fis.sonora[3] was different than $F_{IS,ATPS}=$ {r}fis.sonora[7], which is an entirely different question than the one addressing $H_O: F_{IS} = 0$. That being said, it is not too difficult to do this given the tools you have in R.

Measures of Genetic Distance

There are several genetic distances available within the gstudio package and the ability to use multivariate analogs of genotypes opens up all distance essentially all distance metrics to the end user. In the stuff that follows are the distances that are internally implemented in the package along with a brief overview.

Since all the genetic distance approaches take the same general data, gstudio provides a general interface for all distance metrics in the genetic_distance() function. This function takes the data as either a data frame with locus objects or a vector of locus objects and the genetic distance metric to be estimated and returns the appropriate response. The general form of this function is as follows and only differs in the sense that individual genetic distances do not require a stratum whereas strata distances do. However, as all functions in gstudio that need a stratum, if your default name for that column is “Population” you do not need to provide it.

amova.dist <- genetic_distance(data, mode = "AMOVA")
nei.dist <- genetic_distance(data, stratum = "Population", mode = "Nei")

What follows is a more in-depth overview of each of the genetic distance metrics.

Individual Distances

At the base level, you can estimate distances among individuals resulting in an $NxN$ matrix of pair-wise distances. This is internally how the AMOVA analysis is conducted and is a nice heuristic for conceptual understanding of variance decomposition.

In the following examples, I will use a made-up data locus consisting of four alleles and an individual with each genotype to show how these distances work. Here is the data:

AA <- locus(c("A", "A"))
AB <- locus(c("A", "B"))
BB <- locus(c("B", "B"))
AC <- locus(c("A", "C"))
AD <- locus(c("A", "D"))
BC <- locus(c("B", "C"))
BD <- locus(c("B", "D"))
CC <- locus(c("C", "C"))
CD <- locus(c("C", "D"))
DD <- locus(c("D", "D"))
loci <- c(AA, AB, AC, AD, BB, BC, BD, CC, CD, DD)

AMOVA Distance

The AMOVA distance metric was first introduced in Excoffier et al. (1992) using restriction fragment encodings. However, a more elegant description of it was described in Smouse & Peakall (1999) using a geometric interpretation. Essentially, the coding of alleles at a locus can be depicted by a vector $\vec{p}$ whose length is equal to the number of alleles at the locus. The presence of an allele increments the appropriate element of that vector. For two individuals, the squared vector distance between them is:

\[ \delta_{ij}^2 = 2(p_i-p_j)^2 \]

Across loci, these are additive (though see Smouse & Peakall for additional weighing schemes by locus or allele and the relative power of adopting such approaches). Across a set of individuals the squared genetic distance between all individuals can be represented by a square symmetric matrix with a zero diagonal, D. The AMOVA analysis itself is conducting by taking the “Sums of Squared Genetic Distances” for all individuals (SSGD(Total)), within group SSW, and among groups SSA and variance is decomposed following the standard approach for a random effects model. There is essentially no mystery in this approach, though it has been shrouded in obscurity. Dyer et al. (2004) showed how this is just a multivariate linear model amenable to a much broader range of experimental designs than just 1-way and nested designs.

D <- dist_amova(loci)
rownames(D) <- colnames(D) <- as.character(loci)
D

##     A:A A:B A:C A:D B:B B:C B:D C:C C:D D:D
## A:A   0   1   1   1   4   3   3   4   3   4
## A:B   1   0   1   1   1   1   1   3   2   3
## A:C   1   1   0   1   3   1   2   1   1   3
## A:D   1   1   1   0   3   2   1   3   1   1
## B:B   4   1   3   3   0   1   1   4   3   4
## B:C   3   1   1   2   1   0   1   1   1   3
## B:D   3   1   2   1   1   1   0   3   1   1
## C:C   4   3   1   3   4   1   3   0   1   4
## C:D   3   2   1   1   3   1   1   1   0   1
## D:D   4   3   3   1   4   3   1   4   1   0

Bray Curtis Individual Distance

Bray-Curtis Distance (Bray & Curtis 1957) has been primarily used to quantify differences in species composition. It is defined as the total number of species that are unique to either of the two sites standardized by the number of species in both sites.

\[ BC_\delta = \frac{S_i + S_j - 2S_{ij}}{S_i+S_j} \]

where $S_x$ is the species count and $S_{ij}$ is the sum of minimum abundances. Lately, this has seen considerable use within individual-based landscape genetic studies. Missing genotypes are set to average allele frequencies, that is to say that every missing genotype is considered to have all the alleles present in the entire population, but with probability equal to their global frequencies. Essentially, this removes the problem like in the situation and does so by taking the non-missing genotype’s genetic distance from the global genetic centroid (it’s cosmic man!). Here is the estimation using two loci.

D <- dist_bray(loci)
rownames(D) <- colnames(D) <- as.character(loci)
D

##       A:A   A:B   A:C   A:D   B:B   B:C   B:D   C:C   C:D   D:D
## A:A 0.000 0.000 0.125 0.125 0.125 0.000 0.000 0.000 0.000 0.000
## A:B 0.000 0.000 0.000 0.000 0.125 0.000 0.000 0.125 0.000 0.125
## A:C 0.125 0.000 0.000 0.125 0.125 0.125 0.125 0.125 0.000 0.000
## A:D 0.125 0.000 0.125 0.000 0.125 0.000 0.125 0.000 0.125 0.125
## B:B 0.125 0.125 0.125 0.125 0.000 0.000 0.000 0.125 0.000 0.125
## B:C 0.000 0.000 0.125 0.000 0.000 0.000 0.125 0.125 0.000 0.000
## B:D 0.000 0.000 0.125 0.125 0.000 0.125 0.000 0.125 0.125 0.125
## C:C 0.000 0.125 0.125 0.000 0.125 0.125 0.125 0.000 0.000 0.125
## C:D 0.000 0.000 0.000 0.125 0.000 0.000 0.125 0.000 0.000 0.125
## D:D 0.000 0.125 0.000 0.125 0.125 0.000 0.125 0.125 0.125 0.000

Strata Distances

Genetic distances can also be estimated among partitioned groups of individuals. Here, I will use the data from the mainland Sonoran beetle data set using locus $ATPS$ and the $Population$ stratum as it is small enough to easily display all the results.

data <- arapat[arapat$Species == "Mainland", c(3, 13)]

It should be noted that for some of these metrics, we need to make assumptions about how to represent them as true ‘multilocus’ estimators. Where assumptions are being made, a warning message will be displayed to remind the user that there is an assumption being made in the estimation.

Euclidean Distance

Euclidean distance is the most straight-forward distance metric available as it is essentially straight-line distance based upon the allele frequencies in each population. It is given by:

\[ d_{eucl} = \sqrt{ \sum_{j=1}^L(p_{ij} - p_{kj})^2 } \]

where $p_{ij}$ and $p_{kj}$ are the frequencies of the $j^{th}$ allele in both the $i^{th}$ and $j^{th}$ population. In this and the following distance examples, I am going to take the resulting distance matrix among all pairs of populations and put them into a Neighbor joining tree (via the nj() function from the ape package) as it may be easier to see differences in topologies rather than matrices.

It is perhaps easiest to think of Euclidean distance in x,y coordinate space. This distance can be estimated by stratum.distance() using the optional parameter method=‘eucl’ and it will return a dist matrix.

dist_euclidean(data)

##        101    102     32
## 101 0.0000 0.3335 0.3438
## 102 0.3335 0.0000 0.0874
## 32  0.3438 0.0874 0.0000

Cavalli-Sforza Distance

Another distance approach that is commonly used for microsatellite loci is Cavalli-Sforza distance, $D_C$ (Cavalli-Sforza and Edwards, 1967). Here population allele frequencies are plot on the surface of a sphere (radius=1) using the square root of the allele frequencies.

\[ D_C = \frac{2}{\pi}\sqrt{(2-2cos\theta)} \]

The genetic distance, $D_C$ is measured as the chord distance as indicated in Figure. The resulting Neighbor joining tree from this distance is shown in Figure

\[ D_{CS,m,n} = \frac{2}{\pi}\sqrt{ 2 * 1-\sum_{i=1}^{\ell} p_{m,i}*p_{n,i}} \]

dist_cavalli(data)

##        101    102     32
## 101 0.0000 0.2725 0.3368
## 102 0.2725 0.0000 0.2395
## 32  0.3368 0.2395 0.0000

The $D_{PS}$ Distance

The Bray Curtis distance can also be estimated on groups of individuals. However, when it is done it is often represented as $D_{PS} = (1 - P_S)$ (where $P_S$ is the proportion of shared alleles), which I will follow here for simplicity such that the individual distances and the strata distances are not confused. The $D_{PS}$ distance metric is directly related to Jaccard’s distance as:

\[ D_{PS} = \frac{-J_{\delta(A,B)}}{J_{\delta(A,B)}-2} \]

dist_bray(data)

##        101    102     32
## 101 0.0000 0.1823 0.1798
## 102 0.1823 0.0000 0.2303
## 32  0.1798 0.2303 0.0000

It should be noted here that the function dist_bray() is used for both individual distance AND strata distance depending upon what you pass to it. An individual distance is found by passing it a vector of loci whereas a stratum distance is returned by passing it a data.frame object (that has a $stratum$ column). In the genetic_distance() function these are also differentiated by mode=“Bray” for the individual one and mode=“Dps” for the stratum.

Jaccard Distance

Jaccard distance is a set-theoretic distance quantifying dissimilarity. Assuming that loci are sets of alleles, the Jaccard dissimilarity between genotypes $A$ and $B$ is given by:

\[ J_{\delta(A,B)} = \frac{|A \bigcup B| - |A \bigcap B|}{|A \bigcup B|} \]

dist_jaccard(data)

##        101    102     32
## 101 0.0000 0.3084 0.3048
## 102 0.3084 0.0000 0.3743
## 32  0.3048 0.3743 0.0000

The reverse relationship between $D_{PS}$ and $J_{\delta(A,B)}$ is given as:

\[ J_{\delta(A,B)} = \frac{2+D_{PS}}{1 + D_{PS}} \]

Conditional Genetic Distance (cGD; Graph Distance)

Conditional genetic distance, cGD, is a measure based upon conditional genetic covariance and is distinctly different than these other measures as it is not a pair-wise distance metric. Rather it is the distance through a Population Graph topology whose construction is determined by the totality of the data. To estimate cGD from your data, you can use the genetic.distance() function as before and it will do right thing and return you a matrix. However, you should probably look into what a Population Graph really is prior to using it. You will find more information below as well as in the documents for the popgraph package itself. For consistency, the function is shown below BUT this data set with 3 populations is too small to be of interest for a network analysis (how many ways can you connect 3 things…)

dist_cgd(data)

##       101    102     32
## 101 0.000 1.0737 1.2657
## 102 1.074 0.0000 0.4985
## 32  1.266 0.4985 0.0000

Nei’s Genetic Distance

A very common metric of genetic distance, if you think the data you have are due to drift/mutation balance, is that of Nei. The implementation in gstudio of Nei’s distance is based upon the sample size correction from 1978, and is calculated as:

\[ I = \frac{(2N-1)\sum_{l=1}^L\sum_{m=1}^M p_{Alm}p_{Blm}}{\sqrt{ \sum_{l=1}^L(2N\sum_{m=1}^{M_A}p_{Alm}^2-1)(2N\sum_{m=1}^{M_B}p_{Blm}^2 -1)}} \]

for the “Genetic Identity” (where $p_A$ is the allele frequencies at one population and $p_B$ are the corresponding frequencies for the other, $L$ is across loci and $M$ is across alleles at the $l^{th}$ locus).

Nei’s (1978) genetic distance, $D_N$, is:

\[ D_N = -ln(I) \]

dist_nei(data)

##         101      102       32
## 101 0.00000 0.037411 0.049440
## 102 0.03741 0.000000 0.002092
## 32  0.04944 0.002092 0.000000

Comparing strata distance metrics

There are several more genetic distance metrics available and each may have its own set of assumptions. However, it is also true that across these metrics, there is some similarity between them. Just for illustrative purposes, we’ll look at the various strata distances and plot them against each other to look at the correlative structure between alternative measures using the $ATPS$ locus and the entire beetle data set.

x <- arapat[, c(3, 13, 14)]
summary(x)

##    Population       ATPS          MP20    
##  32     : 19   05:05  :155   05:07  : 64  
##  75     : 11   03:03  : 69   07:07  : 53  
##  Const  : 11   09:09  : 66   18:18  : 52  
##  12     : 10   02:02  : 30   05:05  : 48  
##  153    : 10   07:09  : 14   05:06  : 22  
##  157    : 10   08:08  :  9   (Other):119  
##  (Other):292   (Other): 20   NA's   :  5

Now, we’ll grab the distance metrics

dist.euc <- genetic_distance(x, mode = "Euclidean")
dist.cgd <- genetic_distance(x, mode = "cGD")
dist.nei <- genetic_distance(x, mode = "Nei")
dist.dps <- genetic_distance(x, mode = "Dps")
dist.jac <- genetic_distance(x, mode = "Jaccard")

and then take the upper triangle of each and put them into a $data.frame$.

df <- data.frame(Euclidean = dist.euc[upper.tri(dist.euc)])
df$cGD <- dist.cgd[upper.tri(dist.cgd)]
df$Nei <- dist.nei[upper.tri(dist.nei)]
df$Dps <- dist.dps[upper.tri(dist.dps)]
df$Jaccard <- dist.jac[upper.tri(dist.jac)]

Before we plot them, there is a bit of cleaning up to do in these data. For populations that have no alleles in common, Nei’s genetic distance will be $Inf$ (e.g., $-log(0)$). Also, with cGD, sets of populations that are independent will also have a infinite distance (e.g., they are not connected so it is impossible to go through the graph from one population to the other). So with these data, we should first remove them.

and then we can plot them against each other and look at their correlations.

df <- df[is.finite(df$Nei), ]
df <- df[is.finite(df$cGD), ]
require(GGally)
ggpairs(df)

plot of chunk pairs

As you can see, there is a great deal of correlation between these parameters.

Genetic Structure

The estimation of structure from genetic data is a common (and commonly misused) endeavor. The gstudio packages makes a distinction between structural parameters and statistical differences. Structural parameters are the various $X_{ST}$ statistics that crop all too quickly. These are simply parameters of the data and are not sensu stricto measures of differentiation. To quote Sewell Wright (1978), “…” All of these parameters are based upon assumptions related to population genetic processes. Statistical differences, are those analyses we can do on genetic data that test a specific hypothesis that is not based upon population genetic understanding, rather the simple properties of multinomial multivariate data.

Structural Parameters

Since first introduced by Sewell Wright as F-statistics, there has been a continued development of related parameters that are used to characterize ‘population structure’ in one way or another. The parameters that gstudio provides are sufficient for most needs. These include:

$G_{ST}$: This is Nei’s parameter.
$G_{ST}^\prime$: This is the modification of Nei’s $G_{ST}$ as proposed by Hedrick.
$D_{est}$: This is the parameter of Joost.

Both $G_{ST}^\prime$ and $D_{est}$ are parameters derived for loci with lots of alleles. There was a heated debate in the literature between Hedrick & Joost about issues related to Nei’s $G_{ST}$ for data with many loci (say >6 to be conservative) and these are two options available for your use. I would recommend looking over the debate to decide which may be more appropriate for you (using them both is a lame option and you will be mocked by your reviewers if you take that approach).

Here are some examples of how to estimate these parameters using the beetle data. In all of these approaches, you pass both the stratum and the locus and they return a data frame.

As in the case for genetic distance measures, structure parameters can also be estimated using either the individual structure functions OR the generalized function genetic_structure() with the appropriate options. The benefit of using genetic_structure() is that it allows you to do pairwise analyses (it returns a matrix of pairwise structure or a list of pairwise matrices, one for each locus).

Nei’s $G_{ST}$ Parameter

Gst(arapat$LTRS, arapat$Population, nperm = 99)

##      Gst     Hs     Ht P
## 1 0.5356 0.2319 0.4992 0

If the loci that you pass is a bunch of loci in a data.frame, it will return the single locus estimate as well as the multilocus estimate (based upon summing the heterosexuality and then estimating it as in Nei, not in just averaging the $G_{ST}$ values as in Berg & Hamrick (XXXX), see XXXX for more on the differences).

Gst(arapat[, c(3, 7:14)], nperm = 99)

##        Locus    Gst     Hs     Ht  P
## 1       LTRS 0.5356 0.2319 0.4992  0
## 2        WNT 0.4364 0.3682 0.6534  0
## 3         EN 0.4283 0.2569 0.4493  0
## 4         EF 0.5550 0.1942 0.4364  0
## 5        ZMP 0.4724 0.1792 0.3397  0
## 6        AML 0.2954 0.5850 0.8302  0
## 7       ATPS 0.7103 0.2085 0.7197  0
## 8       MP20 0.3088 0.5663 0.8194  0
## 9 Multilocus 0.4544 2.5903 4.7474 NA

Hendrick’s $G_{ST}^\prime$ Parameter

A correction to Nei’s $G_{ST}$ was suggested for loci with a lot of alleles. This is because the maximum value for expected heterozygosity is determined by the number of alleles and as such $G_{ST}$ for high allelic loci is not bound on the interval $[0,1]$ but is maxed out below 1.0. As a consequence, Hedrick

sort(unique(matrix(alleles(arapat$MP20), ncol = 1)))

##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19"

Gst_prime(arapat$MP20, arapat$Population, nperm = 99)

##      Gst     Hs     Ht P
## 1 0.7228 0.5663 0.8194 0

In a similar fashion, the multilocus analog can be found by passing a data.frame to the function (again I am skipping the stratum variable to this function as the data has the strata in a column named ‘Population’).

Gst_prime(arapat[, c(3, 7:14)], nperm = 99)

##        Locus    Gst     Hs     Ht  P
## 1       LTRS 0.7015 0.2319 0.4992  0
## 2        WNT 0.6975 0.3682 0.6534  0
## 3         EN 0.5803 0.2569 0.4493  0
## 4         EF 0.6923 0.1942 0.4364  0
## 5        ZMP 0.5783 0.1792 0.3397  0
## 6        AML 0.7227 0.5850 0.8302  0
## 7       ATPS 0.9023 0.2085 0.7197  0
## 8       MP20 0.7228 0.5663 0.8194  0
## 9 Multilocus 0.6878 0.2674 0.5402 NA

Joost’s $D_{est}$ Parameter

Following a discussion back-and-forth between Hedrick & Joost, Joost proposed an alternative measure $D_{EST}$. The estimation of this parameter is found in a similar way as the other structure parameters.

Dest(arapat$MP20, arapat$Population, nperm = 99)

##     Dest     Hs     Ht P
## 1 0.5952 0.5663 0.8194 0

Dest(arapat[, c(3, 7:14)], nperm = 99)

##        Locus   Dest     Hs     Ht  P
## 1       LTRS 0.3496 0.2319 0.4992  0
## 2        WNT 0.4563 0.3682 0.6534  0
## 3         EN 0.2659 0.2569 0.4493  0
## 4         EF 0.3019 0.1942 0.4364  0
## 5        ZMP 0.1999 0.1792 0.3397  0
## 6        AML 0.6037 0.5850 0.8302  0
## 7       ATPS 0.6340 0.2085 0.7197  0
## 8       MP20 0.5952 0.5663 0.8194  0
## 9 Multilocus 0.3629 0.3238 0.5934 NA

Similarities in Parameters

As in genetic distance metrics, there is some similarity in output from these structure parameters. Here is a paired plot of the three parameters as above.

gst <- Gst(arapat)$Gst
gstp <- Gst_prime(arapat)$Gst
dest <- Dest(arapat)$Dest
df <- data.frame(Gst = gst, Gst_prime = gstp, Dest = dest)
require(GGally)
ggpairs(df)

plot of chunk structurepairs

You can tell from these plots which locus has a lot of alleles and which ones do not…

Statistical Differences

There are two statistical tests of differentiation you can conduct in R using gstudio and both of them are based upon the same approach. Originally shown by Weir & Cockerham (1984), the parameter $\theta$ can be though of as the proportion of among-strata variance in a random effects Analysis of Variance based upon single loci. A decade later, Excoffier et al. (1992) showed that the same kind of parameter can be estimated using multilocus data. Unfortunately, they calling the parameter $\Phi_{ST}$ and this has caused some confusion among analysts who confuse this parameter with the structural parameters above. I should not throw stones, I also followed in the obfuscation of these parameters in Smouse et al. (2001) when we showed how to use this approach to analyze pollen pools and called it $\Phi_{FT}$ (the ‘F’ for father as we are analyzing paternal pollen pools). We should have also called it PAMOVA for “Pollen AMOVA” but you cannot change history). Later, we (Dyer et al. 2004) showed that both $\theta$ and $\Phi$ are simply multivariate general linear models based upon $NxN$ covariance matrices rather than $pxp$ ones (though the broader use of more powerful statistical models has yet to be fully embraced). There is a direct link between $\theta$ and $\Phi$ in that they are identical, both statistically and numerically, wen you only use a single locus. What $\Phi$ does it to extend the analysis to a multilocus analysis.

Here is an example with the larger Clade using the function adonis() from the vegan package.

data <- arapat[arapat$Species == "Cape", ]
D <- dist(dist_amova(data))
Pop <- factor(as.character(data$Population))
require(vegan)
adonis(D ~ Pop)

## 
## Call:
## adonis(formula = D ~ Pop) 
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model    R2 Pr(>F)  
## Pop       13      1029    79.2    1.75 0.272  0.033 *
## Residuals 61      2759    45.2         0.728         
## Total     74      3788                 1.000         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Parent-Offspring Data

There are several functions in the gstudio package that relate to working with parent/offspring data, due in large part to my own research program. Here is an example of how one can work with these kinds of data.

Data Formatting

So when we have both adult and offspring data in the same data file, we need to come up with a way to make sure we can differentiate between adults and offspring. In gstudio I do this using two identification columns. 1. All adults should have a unique identification column. The default for this is “ID” in the functions and it is pretty easy to remember that one.
2. Offspring have an ID column as well and it must be the same as the maternal individual. Again, I’m coming from a plant perspective and think of this as seeds on a tree. 3. A second identification column (default=“OffID”) allows the differentiation between the maternal individual and their offspring. By convention, maternal individuals have OffID=“0” and all the offspring from that mother have a value different than “0”. It should be noted that identification columns are just like stratum columns and are commonly encoded as factor objects.

With this out of the way, working with parent/offspring data can be rather easy. Here are a few things that can be done using the following small data set. I’m going to make this randomly using the mate() function (see below in the section on Simulations for more on this) and then order the data set by mother and offspring.

pgm <- c(locus(1:2), locus(c(1, 1)), locus(c(2, 2)), locus(1:2), locus(c(1, 
    1)))
tpi <- c(locus(3:4), locus(c(2:3)), locus(c(4, 4)), locus(c(3, 5)), locus(c(3, 
    4)))
fe <- c(locus(c(5, 7)), locus(c(5, 7)), locus(c(5, 5)), locus(c(5, 5)), locus(c(5, 
    7)))
ID <- factor(paste("Ind", 1:5, sep = "-"))
data <- data.frame(ID = ID, OffID = 0, PGM = pgm, TPI = tpi, FE = fe)
data

##      ID OffID PGM TPI  FE
## 1 Ind-1     0 1:2 3:4 5:7
## 2 Ind-2     0 1:1 2:3 5:7
## 3 Ind-3     0 2:2 4:4 5:5
## 4 Ind-4     0 1:2 3:5 5:5
## 5 Ind-5     0 1:1 3:4 5:7

offs <- mate(data[1, ], data[2, ], N = 10)
data <- rbind(data, offs)
data <- data[order(data$ID, data$OffID), ]
data

##       ID OffID PGM TPI  FE
## 1  Ind-1     0 1:2 3:4 5:7
## 6  Ind-1     1 1:2 3:4 5:7
## 7  Ind-1     2 1:1 3:4 5:5
## 8  Ind-1     3 1:2 3:4 5:7
## 9  Ind-1     4 1:1 2:3 7:7
## 10 Ind-1     5 1:1 3:3 5:7
## 11 Ind-1     6 1:1 3:4 7:7
## 12 Ind-1     7 1:1 2:4 7:7
## 13 Ind-1     8 1:2 2:4 5:7
## 14 Ind-1     9 1:2 3:4 5:5
## 15 Ind-1    10 1:1 3:4 5:5
## 2  Ind-2     0 1:1 2:3 5:7
## 3  Ind-3     0 2:2 4:4 5:5
## 4  Ind-4     0 1:2 3:5 5:5
## 5  Ind-5     0 1:1 3:4 5:7

Paternal Contributions

From each offspring it is possible to remove the contribution of the maternal individual, leaving the male haplotype (e.g, a pollen genotype in plant context).

minus_mom(data)

##       ID OffID PGM TPI  FE
## 6  Ind-1     1 1:2 3:4 5:7
## 7  Ind-1     2   1 3:4   5
## 8  Ind-1     3 1:2 3:4 5:7
## 9  Ind-1     4   1   2   7
## 10 Ind-1     5   1   3 5:7
## 11 Ind-1     6   1 3:4   7
## 12 Ind-1     7   1   2   7
## 13 Ind-1     8 1:2   2 5:7
## 14 Ind-1     9 1:2 3:4   5
## 15 Ind-1    10   1 3:4   5

Here you can see that some of the offspring can have their genotypes reduced whereas others cannot. If mother and offspring are both identical heterozygotes, then it makes it difficult to know if the dad gave the first allele and the mother gave the second or vice-versa. You can work with these reduced genotypes in the same way as you would regular populations (this is essentially what we did in the 2Gener analysis).

Paternity Exclusion

Another common parent/offspring analysis consists of trying to identify paternity of individuals. If you know the mother and have genotyped mother, offspring, and potential fathers, you can estimate paternity. With many loci, you can get very specific. However, that is not always the case in plants for two reasons: 1. We often have MANY potential fathers, not just the ‘accused’ one. +. We may not have the level of exclusion required to have only 1 potential father.

So here are some convenience functions. First we will look at the exclusion probability. This is the probability that you can exclude a randomly selected individual from paternity based upon the allele frequencies in the population.

freqs <- frequencies(data[data$OffID == 0, ])
freqs

##   Locus Allele Frequency
## 1   PGM      1       0.6
## 2   PGM      2       0.4
## 3   TPI      2       0.1
## 4   TPI      3       0.4
## 5   TPI      4       0.4
## 6   TPI      5       0.1
## 7    FE      5       0.7
## 8    FE      7       0.3

pexcl <- exclusion_probability(freqs)
pexcl

##   Locus  Pexcl PexclMax Fraction
## 1    FE 0.1659   0.1875   0.8848
## 2   PGM 0.1824   0.1875   0.9728
## 3   TPI 0.3927   0.5039   0.7793

So for one locus, with adult allele frequencies equal to $p_A = 0.6;\;p_B=0.4$, we have an exclusion probabilities range from 16-39%. The exclusion_probability() function also provides the theoretical maximum exclusion one could get with a locus where all alleles are at equal frequency, as well as how close your observed loci come to that theoretical maximum.

The multilocus exclusion probability can be found by compounding the individual locus ones as:

\[ P_{excl,multilocus} = 1 - \prod_{i=1}^\ell(1 - P_{excl,i}) \]

p.tot <- 1 - prod(1 - pexcl$Pexcl)
p.tot

## [1] 0.5858

p.max <- 1 - prod(1 - pexcl$PexclMax)
p.max

## [1] 0.6725

p.tot/p.max

## [1] 0.8711

Showing that we capture a reasonable amount of the theoretical maximum exclusion we could expect. However, it is still not enough to have a great power of exclusion (e.g., we should at least expect $1/p.tot=${r}1/p.tot) dads per offspring.

Fractional Paternity

One approach found in plant gene flow studies is that of fractional paternity analysis. Here, even if individuals cannot be assigned unambiguously, their relative likelihood of paternity can be estimated. If all dads are known with certainty, then their posterior likelihood will be $F_{IJ}=1.0$ but if more than one dad my be implicated as a potential father, then their relative likelihood is provided.

In the example below I separate out the offspring, mother and potential dads and then run the first three through the paternity() function.

offs <- data[data$OffID != "0", ]
dads <- data[data$OffID == "0", ]
mom <- data[1, ]
p <- paternity(offs, mom, dads)
p

##    MomID OffID DadID     Fij
## 1  Ind-1     1 Ind-1 0.25000
## 2  Ind-1     1 Ind-3 0.25000
## 3  Ind-1     1 Ind-5 0.25000
## 4  Ind-1     1 Ind-2 0.12500
## 5  Ind-1     1 Ind-4 0.12500
## 6  Ind-1     2 Ind-5 0.40000
## 7  Ind-1     2 Ind-1 0.20000
## 8  Ind-1     2 Ind-2 0.20000
## 9  Ind-1     2 Ind-4 0.20000
## 10 Ind-1     3 Ind-1 0.25000
## 11 Ind-1     3 Ind-3 0.25000
## 12 Ind-1     3 Ind-5 0.25000
## 13 Ind-1     3 Ind-2 0.12500
## 14 Ind-1     3 Ind-4 0.12500
## 15 Ind-1     4 Ind-2 1.00000
## 16 Ind-1     5 Ind-2 0.33333
## 17 Ind-1     5 Ind-5 0.33333
## 18 Ind-1     5 Ind-1 0.16667
## 19 Ind-1     5 Ind-4 0.16667
## 20 Ind-1     6 Ind-5 0.50000
## 21 Ind-1     6 Ind-1 0.25000
## 22 Ind-1     6 Ind-2 0.25000
## 23 Ind-1     7 Ind-2 1.00000
## 24 Ind-1     8 Ind-2 1.00000
## 25 Ind-1     9 Ind-3 0.36364
## 26 Ind-1     9 Ind-1 0.18182
## 27 Ind-1     9 Ind-4 0.18182
## 28 Ind-1     9 Ind-5 0.18182
## 29 Ind-1     9 Ind-2 0.09091
## 30 Ind-1    10 Ind-5 0.40000
## 31 Ind-1    10 Ind-1 0.20000
## 32 Ind-1    10 Ind-2 0.20000
## 33 Ind-1    10 Ind-4 0.20000

While these offspring are randomly created by the mate() function (and that means every time I make this document they are different), you can see some variability in how many individuals can actually be assigned paternity. It should be noted though that Ind-2 better be indicated as a potential father in all those offspring (’cuz he was the real father).

This is only a starting point and it is intended to provide some quick familiarity to the use of paternity analyses. As shown below in the section on simulations, it would be easy to simulate such things as cryptic gene flow, etc.

Population Graphs

A Population Graph is a model-free network created using conditional genetic covariance. Connectivity within a population graph is based solely upon the multilocus genetic covariance, which is dominated by neutral processes such as gene flow and historical demography. The routines to create population graphs were previously nestled within this package but due to the expansion of the approach outside genetic analyses, I opted to pull those routines out and put them in their own packages. I will show briefly the use of population graphs here, just enough to show how it is done. There is a complete write-up on it (like this one) on my website at !(http://dyerlab.bio.vcu.edu/docs/popgraph.html that goes into it in much greater detail and shows you how to do various analyses on the data.

require(popgraph)

By default, the input to popgraph() is a multivariate data set and a factor to be used to identify the nodes. For this example, we’ll use the larger clade in the beetle, Peninsula, as an example.

data <- to_mv(arapat)
pops <- arapat$Population
graph <- popgraph(x = data, groups = pops)

Now you have the graph and can plot and work with it as you find useful. Here I’ll make a color vector to indicate the species collected in each population. You can get an idea of this by using the table() function as follows:

t <- table(arapat$Species, arapat$Population)
t

##            
##             101 102 12 153 156 157 159 160 161 162 163 164 165 166 168 169
##   Cape        0   0  0   0   6   8   0   0   0   0   3   2   0   2   0   0
##   Mainland    9   8  0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   Peninsula   0   0 10  10   0   2   9  10  10  10   7   8  10   8  10  10
##            
##             171 173 175 177 32 48 51 58 64 73 75 77 84 88 89  9 93 98 Aqu
##   Cape        0   0   0   0  0 10  0  0  0  8 10  1  0  0  0  0  0  3   4
##   Mainland    0   0   0   0 19  0  0  0  0  0  0  0  0  0  0  0  0  0   0
##   Peninsula  10  10   7  10  0  0  7  9  5  2  1  9  9 10 10  9 10  1   4
##            
##             Const ESan Mat SFr
##   Cape          8    6   4   0
##   Mainland      0    0   0   0
##   Peninsula     3    2   1   9

And from there create the node colors from there. I’ll color the mainland populations in dark grey, the clade only populations in light gray, the large clade in red and the populations mixed with both peninsular species in salmon colors.

nodes <- V(graph)$name
color <- rep("#f4a582", length(nodes))
# the A clade
color[t[1, ] != 0] <- "#bababa"
# the B clade
color[t[2, ] != 0] <- "#404040"
# the C clade
color[t[3, ] != 0] <- "#ca0020"
# the mixed populations
color[t[1, ] != 0 & t[3, ] != 0] <- "#f4a582"
color

##  [1] "#404040" "#404040" "#ca0020" "#ca0020" "#bababa" "#f4a582" "#ca0020"
##  [8] "#ca0020" "#ca0020" "#ca0020" "#f4a582" "#f4a582" "#ca0020" "#f4a582"
## [15] "#ca0020" "#ca0020" "#ca0020" "#ca0020" "#ca0020" "#ca0020" "#404040"
## [22] "#bababa" "#ca0020" "#ca0020" "#ca0020" "#f4a582" "#f4a582" "#f4a582"
## [29] "#ca0020" "#ca0020" "#ca0020" "#ca0020" "#ca0020" "#f4a582" "#f4a582"
## [36] "#f4a582" "#f4a582" "#f4a582" "#ca0020"

And now we can plot it. If the graph has a color attributed for the vertices (the nodes), the it will automagically use it. Here I omit the population name for brevity.

V(graph)$color <- color
plot(graph, vertex.label = "")

plot of chunk unnamed-chunk-83

For a lot more on how to use and analyze population graph topologies see the documentation associated with it on my website or from here.

Simulations

The gstudio packages is starting to include some functions that aid in basic simulations of genetic processes. These functions are generally concerned with combining data through a mating process and/or making random populations that follow some particular format.

'+' operator for locus objects. You can take two locus objects and add them together (genetically) and you will produce an offspring genotype.
mate() This function takes two entire individuals accumulating genetic data across loci. By convention the first is the maternal individual and as such the meta data associated with her will be transferred to the offspring (e.g., I am assuming a basic plant model where the offspring is spatially and otherwise associated with the maternal individual).
make_population() This function takes a data.frame containing allele frequencies. This data.frame should mimic what is returned by the frequencies() function. This means that you can make populations that have a single locus, many loci in a population, or several stratum with several loci. However, it does check to see that the format of the data.frame conforms to the format as returned by frequencies().

A Migration/Drift Example

To illustrate a basic example of how these functions work, we’ll do a simple migration model as follows: 1. Create two populations, with different allele frequencies. +. Estimate Gst (simple low diversity locus estimate of structure) +. Migrate some of the individuals (say 10% of the adults) between the populations. I’m going to just switch them directly such that migration is symmetric. +. Mate the individuals to create a new set of populations. +. Go to #2 above and repeat 20 generation recording the estimate of Gst each time. +. Plot the result.

To do this I’ll create a few functions for the migration and mating parts so that you can get an idea of how you can use this yourself (and perhaps enjoy a little cut-n-paste action if you like).

Making Populations

But first, lets create the populations.

freqs <- data.frame(Stratum = rep(c("Pop-1", "Pop-2"), each = 2), Locus = rep("TPI", 
    4), Allele = rep(c("A", "B"), time = 2), Frequency = c(0.75, 0.25, 0.25, 
    0.75))
freqs

##   Stratum Locus Allele Frequency
## 1   Pop-1   TPI      A      0.75
## 2   Pop-1   TPI      B      0.25
## 3   Pop-2   TPI      A      0.25
## 4   Pop-2   TPI      B      0.75

pops <- make_population(freqs, N = 100)
summary(pops)

##  Population        ID          TPI    
##  Pop-1:100   Min.   :  1.0   A:A :64  
##  Pop-2:100   1st Qu.: 25.8   A:B :69  
##              Median : 50.5   B:B :67  
##              Mean   : 50.5            
##              3rd Qu.: 75.2            
##              Max.   :100.0

You can check the creation of the individuals by looking at the resulting allele frequencies. However, it should be noted that you cannot often exactly mimic the allele frequencies due to randomness in selecting alleles (e.g., a fair coin does not give 50/50 with only a few tosses). The following shows they are pretty close.

frequencies(pops, stratum = "Population")

##   Stratum Locus Allele Frequency
## 1   Pop-1   TPI      A     0.770
## 2   Pop-1   TPI      B     0.230
## 3   Pop-2   TPI      A     0.215
## 4   Pop-2   TPI      B     0.785

Isolate Breaking: A Cautionary Tale

As a side note, this is a great place to point out the effects of isolate breaking (e.g., the issue of combining populations when we should not). Here each population is pretty close to no inbreeding (e.g., close to HWE).

Fis(pops[pops$Population == "Pop-1", ])

##   Locus      Fis
## 1   TPI -0.01637

Fis(pops[pops$Population == "Pop-2", ])

##   Locus     Fis
## 1   TPI 0.02237

But if we combine them we see a totally different thing.

Fis(pops)

##   Locus    Fis
## 1   TPI 0.3098

This is a rather poorly appreciated aspect highlighting the case where we do not quite understand the underlying biology yet still forge ahead blindly throwing the data at an analysis.

Mating Entire Populations

The mate() function takes two individuals and produces N offspring. However, we want to take random sets of individuals and make a new population that is the same size. This will take a few lines of code but can easily be accomplished. I’ll wrap it into a function with some comments to illustrate.

mate_population <- function(pop) {
    N <- dim(pop)[1]  # how many to make
    ret <- pop  # Replace loci in place
    for (i in 1:N) {
        # grab 2 random parent indices (no selfing)
        parents <- sample(1:N, size = 2, replace = FALSE)
        # mate them to make 1 offspring
        off <- mate(pop[parents[1], ], pop[parents[2], ], N = 1)
        # add the genotype back to the population.
        ret$TPI[i] <- off$TPI[1]
    }
    return(ret)
}

We can check to see if this works by iterating across generations and looking at allele frequencies. These should bounce around due to drift but should not be all over the place if N is reasonable. Here is a check on that.

T <- 50
df <- data.frame(Time = 1:T, Freq_A = rep(NA, T))
test_freq <- data.frame(Locus = "TPI", Allele = c("A", "B"), Frequency = c(0.5, 
    0.5))
pop <- make_population(test_freq, N = 100)
for (i in 1:T) {
    f <- frequencies(pop)
    df$Freq_A[i] <- f$Frequency[1]
    pop <- mate_population(pop)
}
ggplot(df, aes(x = Time, y = Freq_A)) + geom_line(size = 1.5, color = "darkred") + 
    theme_bw() + ylim(c(0, 1))

plot of chunk unnamed-chunk-89

Migrating Individuals

This migration model will be rather simple. We will simply replace the genotypes of the first 10 percent of the adults ($m=0.10$ which is rather large). Since each generation is made randomly, the position in the data.frame is somewhat arbitrary. Moreover, this function is useful only for the 1-locus data set we have created but simple examples can be simplistic and still provide some information.

migrate <- function(pop) {
    t <- pop$TPI[1:10]
    pop$TPI[1:10] <- pop$TPI[101:110]
    pop$TPI[101:110] <- t
    return(pop)
}

Plotting Structure in Drift/Migration Simulations

OK, lets put it together and plot the results.

df <- data.frame(Time = 1:50, Gst = rep(0, 50))
for (t in 1:50) {
    
    # find structure
    x <- Gst(pops, stratum = "Population")
    df$Gst[t] <- x$Gst[1]
    
    # migrate
    pops <- migrate(pops)
    
    # mate
    pops[1:100, ] <- mate_population(pops[1:100, ])
    pops[101:200, ] <- mate_population(pops[101:200, ])
    
}
ggplot(df, aes(x = Time, y = Gst)) + geom_line(size = 1.5, color = "darkred") + 
    theme_bw() + ylim(c(-0.1, 0.4))

plot of chunk unnamed-chunk-91

So with just a few lines of code, we can create some simulations that show the change in structure as a function of both drift (small $N$ in these populations) and migration ($m=0.1$). The utility of R is that it opens up a wide range of potential simulation work for you and with gstudio you can work with genetic data just like any other kind of data.

References

Bray, J. R. and J. T. Curtis. 1957. An ordination of upland forest communities of southern Wisconsin. Ecological Monographs, 27, 325-349

Cavalli-Sforza LL and Edwards AWF. 1967. Phylogenetic analysis: models and estimation procedures. American Journal of Human Genetics, 19, 233-257.

Dyer RJ, Westfall RD, Sork VL, Smouse PE. 2004. Two-generation analysis of pollen flow across a landscape V: a stepwise approach for extracting factors contributing to pollen structure. Heredity, 92, 204-211.

Excoffier, L., Smouse, P.E., and Quattro, J.M. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131, 479-491.

Garrick RC, Nason JD, Fernandez-Manjaress JFF, Dyer RJ. In Press. Ecological co-associations influence species’ responses to past climatic change: an example from a Sonoran Desert bark beetle. Molecular Ecology

Nei, M. 1978. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 76, 379-390.

Smouse PE and Peakall R. 1999. Spatial autocorrelation analysis of multi-allele and multi-locus genetic microstructure. Heredity, 82, 561-573.

Smouse PE, Dyer RJ, Westfall RD, Sork VL. 2001. Two-generation analysis of pollen flow across a landscape I. Male gamete heterogeneity among females. Evolution, 55, 260-271.

Weir BS, Cockerham CC. 1984. Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38, 1358-1370.

Wright, S. 1978. Evolution and the Genetics of Populations, Vol. 4: Variability Within and Among Natural Populations. University of Chicago Press, Chicago.