2 Data Types
The data we work with comes in many forms—integers, stratum, categories, genotypes, etc.—all of which we need to be able to work with in our analyses. In this chapter, the basic data types we will commonly use in population genetic analyses. This section covers some of the basic types of data we will use in R. These include numbers, character, factors, and logical data types. We will also introduce the locus object from the gstudio library and see how it is just another data type that we can manipulate in R.
The very first hurdle you need to get over is the oddness in the way in which R assigns values to variables.
variable <- value
Yes that is a less-than and dash character. This is the assignment operator that historically has been used and it is the one that I will stick with. In some cases you can use the ‘=’ to assign variables instead but then it takes away the R-ness of R itself. For decision making, the equality operator (e.g., is this equal to that) is the double equals sign ‘==’. We will get into that below where we talk about logical types and later in decision making.
If you are unaware of what type a particular variable may be, you can always use the type()
function and R will tell you.
class( variable )
R also has a pretty good help system built into itself. You can get help for any function by typing a question mark in front of the function name. This is a particularly awesome features because at the end of the help file, there is often examples of its usage, which are priceless. Here is the documentation for the ‘help’ function as given by:
?help
There are also package vignettes available (for most packages you download) that provide additional information on the routines, data sets, and other items included in these packages. You can get a list of vignettes currently installed on your machine by:
vignette()
and vignettes for a particular package by passing the package name as an argument to the function itself.
2.1 Numeric Data Types
The quantitative measurements we make are often numeric, in that they can be represented as as a number with a decimal component (think weight, height, latitude, soil moisture, ear wax viscosity, etc.). The most basic type of data in R, is the numeric type and represents both integers and floating point numbers (n.b., there is a strict integer data type but it is often only needed when interfacing with other C libraries and can for what we are doing be disregarded).
Assigning a value to a variable is easy
x <- 3
x
## [1] 3
By default, R automatically outputs whole numbers numbers within decimal values appropriately.
y <- 22/7
y
## [1] 3.142857
If there is a mix of whole numbers and numbers with decimals together in a container such as
c(x,y)
## [1] 3.000000 3.142857
then both are shown with decimals. The c()
part here is a function that combines several data objects together into a vector and is very useful. In fact, the use of vectors are are central to working in R and functions almost all the functions we use on individual variables can also be applied to vectors.
A word of caution should be made about numeric data types on any computer. Consider the following example.
x <- .3 / 3
x
## [1] 0.1
which is exactly what we’d expect. However, the way in which computers store decimal numbers plays off our notion of significant digits pretty well. Look what happens when I print out x but carry out the number of decimal places.
print(x, digits=20)
## [1] 0.099999999999999991673
Not quite 0.1 is it? Not that far away from it but not exact. That is a general problem, not one that R has any more claim to than any other language and/or implementation. Does this matter much, perhaps not in the realm of the kinds of things we do in ecology and evolutionary studies but it is important that you understand some of the limitations of the systems we are working in.
2.2 Distributions
A lot of the inferences we can glean about our data are based upon distributional assumptions–though we will examine some instances where they are not. R has a wealth of built-in functions specifying different distributions. In general, for each distribution, we have a function that allows us to generate random draws from that distribution as well as to inquire as to the probability (either cumulative or identity) of a particular value.
You can make random sets of numeric data by using using functions describing various distributions. For example, some random numbers from the normal distribution are:
rnorm(10)
## [1] -1.7896164 1.4006612 -0.9579514 -0.7842230 0.9705389 -0.1523524
## [7] -2.5456662 -0.8703008 -1.1223878 -1.1709065
The normal distribution has a probability density function of:
\[ \begin{equation} P(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x-\mu)^2}{2\sigma^2}} \end{equation} \] where the parameters \(\mu\) and \(\sigma\) are the mean and standard deviation (or variance if you consider \(\sigma^2\)). The normal distribution is a foundation in a wide range of statistical approaches. In many cases, a collection of data, if we subtract the mean and divide it by the standard deviation, represents (at least to a close approximation) the Normal distribution (which is why it is so useful). We know quite a bit about this distribution such as:
- Within \(\mu \pm \sigma\) contains ~66% of the observations, wheras,
- Within \(\mu \pm 2\sigma\) has 95%, and
- Within \(\mu \pm 3\sigma\) has 97.5% of all the data.
This is some pretty exciting and useful information, amiright?
By default, the rnorm()
function uses a mean of zero and a standard deviation of one. If you wish, you can specify alternatives for this as:
rnorm(10,mean=42,sd=12)
## [1] 67.70969 41.32254 38.22560 36.04365 45.90079 55.46638 34.60167
## [8] 30.40747 46.60406 46.55458
Here is an interactive widget to play around with the normal distribution to get a feel for the shape of data as defined by measures of the center (mean) and distribution (the standard deviation). Also, note the extent to which the number of samples drawn from this distribution determine our ability to sucessfully describe it. This is the key point to statistical inferences. If we had an infinite number of data points, we would not need statistics. It is only because we are taking a small set of observations and trying to derive biologically relevant inferences from them that we need to use statistical approaches.
Here is another distribution, the Poisson Distribution1. This distribution is defined as:
\[ \begin{equation} P_\lambda(k) = \frac{\lambda^k e^{-\lambda}}{n!} \end{equation} \]
and is parameerized by \(\lambda\) is the mean of the distribution and \(k\) is the number of events.
A poisson distribution with mean 2:
rpois(10,lambda = 2)
## [1] 0 5 1 1 1 2 1 3 2 2
and the \(\chi^2\) distribution with 1 degree of freedom:
rchisq(10, df=1)
## [1] 5.06441955 0.27117703 2.13216525 3.89092790 0.06256183 0.58786188
## [7] 1.30217454 0.88949954 3.40893097 1.53351970
There are several more distributions that if you need to access random numbers, quantiles, probability densities, and cumulative density values are available.
2.2.1 Coercion to Numeric
All data types have the potential ability to take another variable and coerce it into their type. Some combinations make sense, and some do not. For example, if you load in a CSV data file using read_csv(), and at some point a stray non-numeric character was inserted into one of the cells on your spreadsheet, R will interpret the entire column as a character type rather than as a numeric type. This can be a very frustrating thing, spreadsheets should generally be considered evil as they do all kinds of stuff behind the scenes and make your life less awesome.
Here is an example of coercion of some data that is initially defined as a set of characters
x <- c("42","99")
x
## [1] "42" "99"
and is coerced into a numeric type using the as.numeric() function.
y <- as.numeric( x )
y
## [1] 42 99
It is a built-in feature of the data types in R that they all have (or should have if someone is producing a new data type and is being courteous to their users) an as.X()
function. This is where the data type decides if the values asked to be coerced are reasonable or if you need to be reminded that what you are asking is not possible. Here is an example where I try to coerce a non-numeric variable into a number.
x <- "The night is dark and full of terrors..."
as.numeric( x )
## [1] NA
By default, the result should be NA (missing data/non-applicable) if you ask for things that are not possible.
2.3 Characters
A collection of letters, number, and or punctuation is represented as a character data type. These are enclosed in either single or double quotes and are considered a single entity. For example, my name can be represented as:
prof <- "Rodney J. Dyer"
prof
## [1] "Rodney J. Dyer"
In R, character variables are considered to be a single entity, that is the entire prof variable is a single unit, not a collection of characters. This is in part due to the way in which vectors of variables are constructed in the language. For example, if you are looking at the length of the variable I assigned my name to you see
length(prof)
## [1] 1
which shows that there is only one ‘character’ variable. If, as is often the case, you are interested in knowing how many characters are in the variable prof, then you use the
nchar(prof)
## [1] 14
function instead. This returns the number of characters (even the non-printing ones like tabs and spaces.
nchar(" \t ")
## [1] 3
As all other data types, you can define a vector of character values using the c()
function.
x <- "I am"
y <- "not"
z <- 'a looser'
terms <- c(x,y,z)
terms
## [1] "I am" "not" "a looser"
And looking at the length()
and nchar()
of this you can see how these operations differ.
length(terms)
## [1] 3
nchar(terms)
## [1] 4 3 8
2.3.1 Concatenation of Characters
Another common use of characters is concatenating them into single sequences. Here we use the function paste()
and can set the separators (or characters that are inserted between entities when we collapse vectors). Here is an example, entirely fictional and only provided for instructional purposes only.
paste(terms, collapse=" ")
## [1] "I am not a looser"
paste(x,z)
## [1] "I am a looser"
paste(x,z,sep=" not ")
## [1] "I am not a looser"
2.3.2 Coercion to Characters
A character data type is often the most basal type of data you can work with. For example, consider the case where you have named sample locations. These can be kept as a character data type or as a factor (see below). There are benefits and drawbacks to each representation of the same data (see below). By default (as of the version of R I am currently using when writing this book), if you use a function like read_table() to load in an external file, columns of character data will be treated as factors. This can be good behavior if all you are doing is loading in data and running an analysis, or it can be a total pain in the backside if you are doing more manipulative analyses.
Here is an example of coercing a numeric type into a character type using the as.character()
function.
x <- 42
x
## [1] 42
y <- as.character(x)
y
## [1] "42"
2.4 Factors
A factor is a categorical data type. If you are coming from SAS, these are class variables. If you are not, then perhaps you can think of them as mutually exclusive classifications. For example, an sample may be assigned to one particular locale, one particular region, and one particular species. Across all the data you may have several species, regions, and locales. These are finite, and defined, sets of categories. One of the more common headaches encountered by people new to R is working with factor types and trying to add categories that are not already defined.
Since factors are categorical, it is in your best interest to make sure you label them in as descriptive as a fashion as possible. You are not saving space or cutting down on computational time to take shortcuts and label the locale for Rancho Santa Maria as RSN or pop3d or 5. Our computers are fast and large enough, and our programmers are cleaver enough, to not have to rename our populations in numeric format to make them work (hello STRUCTURE I’m calling you out here). The only thing you have to loose by adopting a reasonable naming scheme is confusion in your output.
To define a factor type, you use the function factor()
and pass it a vector of values.
region <- c("North","North","South","East","East","South","West","West","West")
region <- factor( region )
region
## [1] North North South East East South West West West
## Levels: East North South West
When you print out the values, it shows you all the levels present for the factor. If you have levels that are not present in your data set, when you define it, you can tell R to consider additional levels of this factor by passing the optional levels= argument as:
region <- factor( region, levels=c("North","South","East","West","Central"))
region
## [1] North North South East East South West West West
## Levels: North South East West Central
If you try to add a data point to a factor list that does not have the factor that you are adding, it will give you an error (or ‘barf’ as I like to say).
region[1] <- "Bob"
Now, I have to admit that the Error message in its entirety, with its [<-.factor(*tmp*, 1, value = "Bob")
part is, perhaps, not the most informative. Agreed. However, the “invalid factor level” does tell you something useful. Unfortunately, the programmers that put in the error handling system in R did not quite adhere to the spirit of the “fail loudly” mantra. It is something you will have to get good at. Google is your friend, and if you post a questions to (http://stackoverflow.org) or the R user list without doing serious homework, put on your asbestos shorts!
Unfortunately, the error above changed the first element of the region vector to NA (missing data). I’ll turn it back before we move too much further.
region[1] <- "North"
Factors in R can be either unordered (as say locale may be since locale A is not >
, =
, or <
locale B) or they may be ordered categories as in Small < Medium < Large < X-Large
. When you create the factor, you need to indicate if it is an ordered type (by default it is not). If the factors are ordered in some way, you can also create an ordination on the data. If you do not pass a levels= option to the factors()
function, it will take the order in which they occur in data you pass to it. If you want to specify an order for the factors specifically, pass the optional levels=
and they will be ordinated in the order given there.
region <- factor( region, ordered=TRUE, levels = c("West", "North", "South", "East") )
region
## [1] North North South East East South West West West
## Levels: West < North < South < East
2.4.1 Missing Levels in Factors
There are times when you have a subset of data that do not have all the potential categories.
subregion <- region[ 3:9 ]
subregion
## [1] South East East South West West West
## Levels: West < North < South < East
table( subregion )
## subregion
## West North South East
## 3 0 2 2
2.5 Logical Types
A logical type is either TRUE or FALSE, there is no in-between. It is common to use these types in making decisions (see if-else decisions) to check a specific condition being satisfied. To define logical variables you can either use the TRUE or FALSE directly
canThrow <- c(FALSE, TRUE, FALSE, FALSE, FALSE)
canThrow
## [1] FALSE TRUE FALSE FALSE FALSE
or can implement some logical condition
stable <- c( "RGIII" == 0, nchar("Marshawn") == 8)
stable
## [1] FALSE TRUE
on the variables. Notice here how each of the items is actually evaluated as to determine the truth of each expression. In the first case, the character is not equal to zero and in the second, the number of characters (what nchar()
does) is indeed equal to 8 for the character string “Marshawn”.
It is common to use logical types to serve as indices for vectors. Say for example, you have a vector of data that you want to select some subset from.
data <- rnorm(20)
data
## [1] -1.47984467 0.38045622 0.90262058 0.32499509 0.99251601
## [6] 0.83013551 -1.08718569 0.16885998 -0.33847444 -1.07070526
## [11] 0.96843020 -0.28781946 0.54478562 1.17073131 -0.98978650
## [16] 0.07850148 -2.04889771 -1.85031073 -0.18834480 1.11492255
Perhaps you are on interested in the non-negative values
data[ data > 0 ]
## [1] 0.38045622 0.90262058 0.32499509 0.99251601 0.83013551 0.16885998
## [7] 0.96843020 0.54478562 1.17073131 0.07850148 1.11492255
If you look at the condition being passed to as the index
data > 0
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
you see that individually, each value in the data vector is being evaluated as a logical value, satisfying the condition that it is strictly greater than zero. When you pass that as indices to a vector it only shows the indices that are TRUE
.
You can coerce a value into a logical if you understand the rules. Numeric types that equal 0 (zero) are FALSE
, always. Any non-zero value is considered TRUE
. Here I use the modulus operator, %%
, which provides the remainder of a division.
1:20 %% 2
## [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
which used as indices give us
data[ (1:20 %% 2) > 0 ]
## [1] -1.4798447 0.9026206 0.9925160 -1.0871857 -0.3384744 0.9684302
## [7] 0.5447856 -0.9897865 -2.0488977 -0.1883448
You can get as complicated in the creation of indices as you like, even using logical operators such as OR and AND. I leave that as an example for you to play with.
Fishy non?↩