Package Overview

The gstudio package is a package created to make the inclusion of marker based population genetic data in the R workflow easy. An underlying motivation for this package is to provide a link between spatial analysis and graphing packages such that the user can be quickly and easily manipulate data in exploratory ways that aid in gaining biological inferences.


Installing the package

This package requires several other packages for installation. By default, the install should be easily accomplished using the build-in functionalities in R.

install.packages("gstudio")

Occasionally, you should look to see if there are updates to the package by doing the following (this will update all packages you have installed)

update.packages(ask = FALSE)

If you want the most recent version of this package, I make the development builds available on Github (http://github.com/dyerlab/). You can install directly from within R as:

install.packages(devtools)
require(devtools)
install_github("gstudio", "dyerlab", ref = "develop")

I recommend using the latest version, it has a lot of the newer features and I do not check it into github until it has been tested. I only post to CRAN when major versions change.

Loading the Package

And any time you need to use the package, you would just pull it into your session.

require(gstudio)

This should get you everything you need. The gstudio package does contain a lot of build-in documentation including a lot of examples. All the functions and examples associated with them can be found in the build-in documents available from any of the following (n.b., the vignette is a simple placeholder with links to the website.)

help(package = "gstudio")
vignette("gstudio")
system("open http://dyerlab.github.io/gstudio/")

At any point if you have any questions about the values or options for a particular function in gstudio or any other package, you can use the help.search(func) or ? functionality. This file is also kept in sync with the development of the gstudio package (it is in the source package that you downloaded from CRAN) and will serve as a tutorial for your use of this package. If there are any questions that you may have regarding this package, feel free to contact Rodney Dyer and I will get back to you as soon as possible.


Genetic Data

The overriding philosophy behind the gstudio package is to make it as easy as possible to create, load, use, and integrate, genetic marker data into your analysis workflow. As such, we typically use data.frame objects to hold our data and the addition of the locus class as a fundamental data type allows us to continue to do so.

x <- locus(c(1, 2))
class(x)
## [1] "locus"
x
## [1] "1:2"

You can think of a locus object as a vector of alleles. There are several options you can use when constructing a locus object based upon what kind of marker data you are using. These options are passed through the type option to the locus() function. Here are the current options.

type Option | Function ————- | ——— | This is the default value (e.g., nothing passed). It will treat the values passed to locus as alleles to a single locus snp | Alleles are ‘0’, ‘1’, or ‘2’ indicating the number of minor alleles. zyme | Genotypes are encoded as “12” like zymes (e.g., “1” & “2” alleles together). separated | Alleles are already separated by “:” character (for putting in polyploid data) column | Alleles for a single locus are in two separate columns.

Here is some examples.

loc0 <- locus()
loc1 <- locus(1)
loc2 <- locus(c("C", "A"), phased = TRUE)
loc3 <- locus(c("C", "A"))
loc4 <- locus(12, type = "zyme")
loc5 <- locus(1:4)
loc6 <- locus("A:C:C:G", type = "separated")
loci <- c(loc0, loc1, loc2, loc3, loc4, loc5, loc6)
loci
## [1] NA        "1"       "C:A"     "A:C"     "1:2"     "1:2:3:4" "A:C:C:G"

Notice how the printing of each locus object uses the colon character to separate alleles. Also, since the locus object is a basic data type, it can be used in other data structures.

df <- data.frame(ID = 0:6, Loci = loci)
df
##   ID    Loci
## 1  0        
## 2  1       1
## 3  2     C:A
## 4  3     A:C
## 5  4     1:2
## 6  5 1:2:3:4
## 7  6 A:C:C:G

And you can use normal vector processing of the locus vector to do normal R data like operations.

is.na(df$Loci)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
is_heterozygote(df$Loci)
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Importing Data

Importing data is not that difficult. People tend to keep their data in either spreadsheets or text files, which are easily accessible via R, or some archane program format (wave to Arlequin everyone), which may be less accessable. As of version 1.1, gstudio handles data both in raw format (e.g., as they appear in your spreadsheet) or in GENEPOP format. All of these formats are accessed in gstudio using the function, read_population() that is a mix between the traditional read.table() function and the locus() functions.

Import from GENEPOP files

The GENEPOP file format, as interpreted by gstudio has the following format.

  1. It is a text file and preferrably space delimited.
  • The first line has some info for you, it is mostly ignored but should include a bit about what the data are.
  • The next \(K\) lines of the file list the names of the loci to be used, one per line.
  • The rest of the file contains populations. Each population starts with “Pop” alone on its own line. All individuals are assumed to be in the same population until such time the next “Pop” is reached.
  • For each individual, there is an identification value first, and followed by a comma are the genotypes.
  • All genotypes are encoded using 3 digits for each allele (e.g., a 3:5 genotype is 003005). Missing data is all zeros ‘000000’ and haploid is just three digits.

The import from read_population() names loci, adds “ID” for the identification column and adds to each population a “Pop-X” designation. Other than that, it is identical to what is below. However, if you have your data in a spreadsheet, there is no need to shove it into a genepop format to import into R.

Text File Import

Options for this function include:

Option Function
path The full path to the text file
type The locus type (see above).
locus.columns A vector of numbers of columns to be treated as Locus() objects
sep The character used to separate columns (‘,’ is default)
header Columns have header names (e.g., locus names, etc.)

Here are some examples of data files with different kinds of genetic data, each of which exercises the read_population() function in a different way. Hopefully this covers the main types of data being imported, if not, drop me an email Rodney Dyer. Missing genotypes should be missing data or encoded as NA. If you do not have a genotype then leave it blank. There is no reason to use negative numbers or other conventions.

Columns of genotypes are indicated by the required parameter locus.columns so that read_population() knows which columns to treat as locus objects and which to leave as normal data for R. Without this parameter, the data will be read in as character or numeric.

There are some example data files included in the project for you to look at. Depending upon how your computer is set up, they may be placed in different locations. Here is a quick way to find out where the installed folder is for the gstudio package and the location of the ‘data’ folder within it.

system.file("extdata", package = "gstudio")
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/gstudio/extdata"

Two Column Data

Here is an example of data where the genotypes are encoded as two columns of data in a csv file.

file <- system.file("extdata", "data_2_column.csv", package = "gstudio")
data <- read_population(file, type = "column", locus.columns = 4:7)
data
##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 12:14 15:15
## 2 Bellingham    48.75   -122.49 12:12 15:17
## 3  St. Louis    38.81    -89.98 10:12      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

Phased Data

There are times when the gametic phase of the genotypes is important. By default, gstudio will keep alleles sorted in alpha/numeric order. If you need to keep this from happening, pass the optional phased option to read_population(). Notice the differences between this and the previous genotypes.

file <- system.file("extdata", "data_2_column.csv", package = "gstudio")
data <- read_population(file, type = "column", locus.columns = 4:7, phased = TRUE)
data
##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 14:12 15:15
## 2 Bellingham    48.75   -122.49 12:12 17:15
## 3  St. Louis    38.81    -89.98 12:10      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

AFLP-like data

Genotypes that are ‘aflp’-like are encoded as binary characters (e.g., 0/1) indicating the presence or absence of a particular band.

file <- system.file("extdata", "data_aflp.csv", package = "gstudio")
data <- read_population(file, type = "aflp", locus.columns = c(4, 5))
data
##   Population Latitude Longitude Loc1 Loc2
## 1    Olympia    47.15   -122.89    1    1
## 2 Bellingham    48.75   -122.49    0    0
## 3  St. Louis    38.81    -89.98    1    0
## 4       Ames    42.26    -93.47    0    1
## 5   Richmond    37.74    -77.16    0    0

SNP Minor Allele Data

At times, SNP data is encoded in relation to the number of minor alleles. You can import these data using the type=“snp” option and it will encode them as ‘AA’, ‘AB’, or ‘BB’ with the ‘B’ allele as the minor one.

file <- system.file("extdata", "data_snp.csv", package = "gstudio")
data <- read_population(file, type = "snp", locus.columns = 4:7)
data
##   Population Latitude Longitude snp1 snp2 snp3 snp4
## 1    Olympia    47.15   -122.89  A:B  B:B  A:B  A:B
## 2 Bellingham    48.75   -122.49  A:B  A:A  A:B  B:B
## 3  St. Louis    38.81    -89.98  A:B  A:A          
## 4       Ames    42.26    -93.47  A:A  A:B          
## 5   Richmond    37.74    -77.16  A:A  B:B  A:B  B:B

Zyme-Like Data

Some data is encoded as allozyme genotypes (e.g., 33, 35, 55 for diploid individuals with alleles ‘3’ and ‘5’).

file <- system.file("extdata", "data_zymelike.csv", package = "gstudio")
data <- read_population(file, type = "zyme", locus.columns = 4:7)
data
##   Population Latitude Longitude Loc1 Loc2 Loc3 Loc4
## 1    Olympia    47.15   -122.89  1:2  1:4  1:5  1:5
## 2 Bellingham    48.75   -122.49  1:2  1:2  1:5  1:7
## 3  St. Louis    38.81    -89.98  1:2  1:3          
## 4       Ames    42.26    -93.47  1:1  1:4          
## 5   Richmond    37.74    -77.16  1:2  1:1  2:2  2:2

Pre-Separated Data For Higher Ploidy

file <- system.file("extdata", "data_separated.csv", package = "gstudio")
data <- read_population(file, type = "separated", locus.columns = c(4, 5))
data
##   Population Latitude Longitude  Loc1  Loc2
## 1    Olympia    47.15   -122.89 12:14 15:15
## 2 Bellingham    48.75   -122.49 12:12 15:17
## 3  St. Louis    38.81    -89.98 12:10      
## 4       Ames    42.26    -93.47 10:14      
## 5   Richmond    37.74    -77.16 10:10 17:17

Saving Data

There are several ways to export your data to file.

Raw R objects

Saving data once it is in R is trivial and you do it as you would for any other R object. The R object system knows how to serialize its own data using the load() and save() functions.

save(df, file = "MyData.rda")

To load the objects back into the work space, you just do:

load("MyData.rda")

And you can verify that you have data in your work space by listing it.

ls()
##  [1] "data"      "df"        "file"      "loc0"      "loc1"     
##  [6] "loc2"      "loc3"      "loc4"      "loc5"      "loc6"     
## [11] "loci"      "metadata"  "tidy.opts" "x"

Saving as Text

As a default, the function write_population() will write your data file as a comma separated text file with the loci encoded as column separated (see type="separated" above).

write_population(df, file = "~/Desktop/MyData.csv")

Saving as GENEPOP

Raw data can be saved in GENEPOP formats by passing an optional argument mode="genepop" to the write_population() function.

write_population(df, file = "~/Desktop/MyData.txt", mode = "genepop")

Saving for STRUCTURE

Raw data can also be exported for analysis using the program STRUCTURE. Here the optional argument is type="structure"

write_population(df, file = "~/Desktop/MyData.str", mode = "structure")

Data sets in the Package

The main genetic data included with the package if from the Sonoran desert bark beetle, Araptus attenuatus from the Dyer laboratory. You can load it into your work space by:

data(arapat)
summary(arapat)
##       Species      Cluster      Population        ID         Latitude   
##  Cape     : 75   CBP-C :150   32     : 19   101_10A:  1   Min.   :23.1  
##  Mainland : 36   NBP-C : 84   75     : 11   101_1A :  1   1st Qu.:24.6  
##  Peninsula:252   SBP-C : 18   Const  : 11   101_2A :  1   Median :26.2  
##                  SCBP-A: 75   12     : 10   101_3A :  1   Mean   :26.3  
##                  SON-B : 36   153    : 10   101_4A :  1   3rd Qu.:27.5  
##                               157    : 10   101_5A :  1   Max.   :29.3  
##                               (Other):292   (Other):357                 
##    Longitude        LTRS          WNT            EN           EF     
##  Min.   :-114   01:01 :147   03:03  :108   01:01  :225   01:01 :219  
##  1st Qu.:-113   01:02 : 86   01:01  : 82   01:02  : 52   01:02 : 52  
##  Median :-112   02:02 :130   01:03  : 77   02:02  : 38   02:02 : 90  
##  Mean   :-112                02:02  : 62   03:03  : 22   NA's  :  2  
##  3rd Qu.:-110                03:04  :  8   01:03  :  7               
##  Max.   :-109                (Other): 15   (Other): 16               
##                              NA's   : 11   NA's   :  3               
##      ZMP           AML           ATPS          MP20    
##  01:01 : 46   08:08  : 51   05:05  :155   05:07  : 64  
##  01:02 : 51   07:07  : 42   03:03  : 69   07:07  : 53  
##  02:02 :233   07:08  : 42   09:09  : 66   18:18  : 52  
##  NA's  : 33   04:04  : 41   02:02  : 30   05:05  : 48  
##               07:09  : 22   07:09  : 14   05:06  : 22  
##               (Other):142   08:08  :  9   (Other):119  
##               NA's   : 23   (Other): 20   NA's   :  5

You can see several things here. First, the locus data are displayed by genotype counts including NA values where there was missing data. A column of type locus is just like any other kind of variable and can be used as such. This opens up a lot of functionality for you to be able to treat marker data just like everything else in R.

Convenience Functions

Dealing easily with parts of your data is a critical skill and a huge benefit in using a grammar like R to do your analyses. In R, data.frame objects are almost like little databases and you can do some really creative manipulations with them. The gstudio package provides a few things that may help you work more efficiently with your data.

Data Classes

Often it is important to know which columns of a data set are actually of a particular data type. Here is a simple function that tells you either the name or index of columns in a data.frame that of a specific data type.

column_class(arapat, class = "locus")
## [1] "LTRS" "WNT"  "EN"   "EF"   "ZMP"  "AML"  "ATPS" "MP20"
column_class(arapat, class = "locus", mode = "index")
## [1]  7  8  9 10 11 12 13 14

Partitioning

The partition() function takes a data frame and returns a list of data frames, partitioned by the stratum you pass to it. This is really nice if you are doing a nested analysis of sorts and want to work with subsets of your data that are defined by a categorical factor variable.

names(arapat)
##  [1] "Species"    "Cluster"    "Population" "ID"         "Latitude"  
##  [6] "Longitude"  "LTRS"       "WNT"        "EN"         "EF"        
## [11] "ZMP"        "AML"        "ATPS"       "MP20"
clades <- partition(arapat, stratum = "Species")
names(clades)
## [1] "Cape"      "Mainland"  "Peninsula"

This kind of partitioning is very common in the analysis of spatial genetic structure and as such should be as simple as possible to provide the most flexibility to you, the analyst. One of the common analysis patterns that you will come back to over and over again is to partition the entire data set and perform operations on each of the subgroups. In R this is a pretty easy process if you look into the lapply() function (and its relatives). This is such an important component, that I’m going to spend a little time here to make sure you understand what I am doing. Once you get it, it will make you life tremendously more awesome!

The basic form of the various ‘apply’ functions is that you pass them some data and a function on which it will take each part of that data and apply it. For lists (the l part in lapply()), the function will take each entry in the list and pass it along to the function. The function itself can be one that is already available (like length() or is.na() or something) or it can be something you specify directly, on the fly. Here is an example looking at the number of samples in each ‘Species’ as partitioned above.

lapply(clades, dim)
## $Cape
## [1] 75 14
## 
## $Mainland
## [1] 36 14
## 
## $Peninsula
## [1] 252  14

Here the dim() function returns the dimension of the data.frame in each clade, all have the same number of columns, but differ in the number of individuals (the rows). While this is a stupid example (you could get the same thing from dim(arapat$Species) but it shows the general pattern.

Where these functions come in handy is in creating your own functions. These can be small functions that do something specific and can be shoved into the lapply() function directly. This gives a lot of leeway in what you can do in R and is a very powerful way of understanding data. Here is an example, say we want to know the number of individuals of each “Species” found in each population (some are in sympatry in the Cape Region of Baja California). The general approach will be:

  1. Partition on population
pops <- partition(arapat, stratum = "Population")
counts <- lapply(pops, function(x) {
    return(table(x$Species))
})
m <- matrix(unlist(counts), ncol = 3, byrow = TRUE)
rownames(m) <- names(pops)
colnames(m) <- levels(arapat$Species)
m[1:10, ]
##     Cape Mainland Peninsula
## 101    0        9         0
## 102    0        8         0
## 12     0        0        10
## 153    0        0        10
## 156    6        0         0
## 157    8        0         2
## 159    0        0         9
## 160    0        0        10
## 161    0        0        10
## 162    0        0        10

(only the first 10 rows are displayed). You will see this kind of use of the lapply() throughout the rest of this document.


Plotting Populations

One of the key benefits to using an analysis environment such as R is that you can mash together functionality that you just can’t get from a monolithic program. An example of this is plotting populations. If your data has spatial coordinates in them then you can use this to plot the location of your sites on a GoogleMap tile. By default, you must have your coordinates in decimal degree format. This is the default for the GoogleMaps API. Moreover, if you name the columns of data “Longitude” and “Latitude” much of the spatial functionality in gstudio will be more transparent (if not, you have to specify the Longitude and Latitude names each time you use a function that needs them).

require(ggplot2)
plot(arapat)