4 Data Types

The data we work with comes in many forms—integers, stratum, categories, genotypes, etc.—all of which we need to be able to work with in our analyses. In this chapter, the basic data types we will commonly use in population genetic analyses. This section covers some of the basic types of data we will use in R. These include numbers, character, factors, and logical data types. We will also introduce the locus object from the gstudio library and see how it is just another data type that we can manipulate in R.

The very first hurdle you need to get over is the oddness in the way in which R assigns values to variables.

variable <- value

Yes that is a less-than and dash character. This is the assignment operator that historically has been used and it is the one that I will stick with. In some cases you can use the ‘=’ to assign variables instead but then it takes away the R-ness of R itself. For decision making, the equality operator (e.g., is this equal to that) is the double equals sign ‘==’. We will get into that below where we talk about logical types and later in decision making.

If you are unaware of what type a particular variable may be, you can always use the type() function and R will tell you.

class( variable )

R also has a pretty good help system built into itself. You can get help for any function by typing a question mark in front of the function name. This is a particularly awesome features because at the end of the help file, there is often examples of its usage, which are priceless. Here is the documentation for the ‘help’ function as given by:

?help

There are also package vignettes available (for most packages you download) that provide additional information on the routines, data sets, and other items included in these packages. You can get a list of vignettes currently installed on your machine by:

vignette()

and vignettes for a particular package by passing the package name as an argument to the function itself.

Numeric Data Types

The quantitative measurements we make are often numeric, in that they can be represented as as a number with a decimal component (think weight, height, latitude, soil moisture, ear wax viscosity, etc.). The most basic type of data in R, is the numeric type and represents both integers and floating point numbers (n.b., there is a strict integer data type but it is often only needed when interfacing with other C libraries and can for what we are doing be disregarded).

Assigning a value to a variable is easy

x <- 3
x

[1] 3

By default, R automatically outputs whole numbers numbers within decimal values appropriately.

y <- 22/7
y

[1] 3.142857

If there is a mix of whole numbers and numbers with decimals together in a container such as

c(x,y)

[1] 3.000000 3.142857

then both are shown with decimals. The c() part here is a function that combines several data objects together into a vector and is very useful. In fact, the use of vectors are are central to working in R and functions almost all the functions we use on individual variables can also be applied to vectors.

A word of caution should be made about numeric data types on any computer. Consider the following example.

x <- .3 / 3
x

[1] 0.1

which is exactly what we’d expect. However, the way in which computers store decimal numbers plays off our notion of significant digits pretty well. Look what happens when I print out x but carry out the number of decimal places.

print(x, digits=20)

[1] 0.099999999999999991673

Not quite 0.1 is it? Not that far away from it but not exact. That is a general problem, not one that R has any more claim to than any other language and/or implementation. Does this matter much, probably not in the realm of the kinds of things we do in population genetics, it is just something that you should be aware of. You can make random sets of numeric data by using using functions describing various distributions. For example, some random numbers from the normal distribution are:

rnorm(10)

 [1]  1.1986033  0.9622868  0.4817572  1.1840362  0.3075965 -0.6129430
 [7] -0.8376870 -0.3147793  1.3616952  0.7582906

from the normal distribution with designated mean and standard deviation:

rnorm(10,mean=42,sd=12)

 [1] 36.42172 33.83305 46.55612 63.00495 61.23703 35.63111 36.85864 29.60351
 [9] 34.82064 28.65490

A poisson distribution with mean 2:

rpois(10,lambda = 2)

 [1] 0 3 1 2 2 2 2 5 2 6

and the \(\chi^2\) distribution with 1 degree of freedom:

rchisq(10, df=1)

 [1] 0.47745273 0.10883143 0.21057793 0.04400604 1.98712937 0.20192578
 [7] 0.42243174 6.00655602 3.01265467 0.52868058

There are several more distributions that if you need to access random numbers, quantiles, probability densities, and cumulative density values are available.

Coercion to Numeric

All data types have the potential ability to take another variable and coerce it into their type. Some combinations make sense, and some do not. For example, if you load in a CSV data file using read_csv(), and at some point a stray non-numeric character was inserted into one of the cells on your spreadsheet, R will interpret the entire column as a character type rather than as a numeric type. This can be a very frustrating thing, spreadsheets should generally be considered evil as they do all kinds of stuff behind the scenes and make your life less awesome.

Here is an example of coercion of some data that is initially defined as a set of characters

x <- c("42","99")
x

[1] "42" "99"

and is coerced into a numeric type using the as.numeric() function.

y <- as.numeric( x )
y

[1] 42 99

It is a built-in feature of the data types in R that they all have (or should have if someone is producing a new data type and is being courteous to their users) an as.X() function. This is where the data type decides if the values asked to be coerced are reasonable or if you need to be reminded that what you are asking is not possible. Here is an example where I try to coerce a non-numeric variable into a number.

x <- "The night is dark and full of terrors..."
as.numeric( x )

Warning: NAs introduced by coercion

[1] NA

By default, the result should be NA (missing data/non-applicable) if you ask for things that are not possible.

Character Data

A collection of letters, number, and or punctuation is represented as a character data type. These are enclosed in either single or double quotes and are considered a single entity. For example, my name can be represented as:

prof <- "Rodney J. Dyer"
prof

[1] "Rodney J. Dyer"

In R, character variables are considered to be a single entity, that is the entire prof variable is a single unit, not a collection of characters. This is in part due to the way in which vectors of variables are constructed in the language. For example, if you are looking at the length of the variable I assigned my name to you see

length(prof)

[1] 1

which shows that there is only one ‘character’ variable. If, as is often the case, you are interested in knowing how many characters are in the variable prof, then you use the

nchar(prof)

[1] 14

function instead. This returns the number of characters (even the non-printing ones like tabs and spaces.

nchar(" \t ")

[1] 3

As all other data types, you can define a vector of character values using the c() function.

x <- "I am"
y <- "not"
z <- 'a looser'
terms <- c(x,y,z)
terms

[1] "I am"     "not"      "a looser"

And looking at the length() and nchar() of this you can see how these operations differ.

length(terms)

[1] 3

nchar(terms)

[1] 4 3 8

Concatenation of Characters

Another common use of characters is concatenating them into single sequences. Here we use the function paste() and can set the separators (or characters that are inserted between entities when we collapse vectors). Here is an example, entirely fictional and only provided for instructional purposes only.

paste(terms, collapse=" ")

[1] "I am not a looser"

paste(x,z)

[1] "I am a looser"

paste(x,z,sep=" not ")

[1] "I am not a looser"

Coercion to Characters

A character data type is often the most basal type of data you can work with. For example, consider the case where you have named sample locations. These can be kept as a character data type or as a factor (see below). There are benefits and drawbacks to each representation of the same data (see below). By default (as of the version of R I am currently using when writing this book), if you use a function like read_table() to load in an external file, columns of character data will be treated as factors. This can be good behavior if all you are doing is loading in data and running an analysis, or it can be a total pain in the backside if you are doing more manipulative analyses.

Here is an example of coercing a numeric type into a character type using the as.character() function.

x <- 42
x

[1] 42

y <- as.character(x)
y

[1] "42"

Logical Types

A logical type is either TRUE or FALSE, there is no in-between. It is common to use these types in making decisions (see if-else decisions) to check a specific condition being satisfied. To define logical variables you can either use the TRUE or FALSE directly

canThrow <- c(FALSE, TRUE, FALSE, FALSE, FALSE)
canThrow

[1] FALSE  TRUE FALSE FALSE FALSE

or can implement some logical condition

stable <- c( "RGIII" == 0, nchar("Marshawn") == 8)
stable

[1] FALSE  TRUE

on the variables. Notice here how each of the items is actually evaluated as to determine the truth of each expression. In the first case, the character is not equal to zero and in the second, the number of characters (what nchar() does) is indeed equal to 8 for the character string “Marshawn”.

It is common to use logical types to serve as indices for vectors. Say for example, you have a vector of data that you want to select some subset from.

data <- rnorm(20)
data

 [1] -0.1938137 -0.8822694  0.3264208  0.9503469  0.1067894 -1.3603158
 [7]  0.3456918  1.1389503  0.4323083 -0.8093313 -0.2968339  0.4930676
[13]  0.4695398 -0.4246004  0.8634220 -0.4695553 -0.8200834  0.9042916
[19]  0.0129408  0.5601540

Perhaps you are on interested in the non-negative values

data[ data > 0 ]

 [1] 0.3264208 0.9503469 0.1067894 0.3456918 1.1389503 0.4323083 0.4930676
 [8] 0.4695398 0.8634220 0.9042916 0.0129408 0.5601540

If you look at the condition being passed to as the index

data > 0

 [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
[13]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

you see that individually, each value in the data vector is being evaluated as a logical value, satisfying the condition that it is strictly greater than zero. When you pass that as indices to a vector it only shows the indices that are TRUE.

You can coerce a value into a logical if you understand the rules. Numeric types that equal 0 (zero) are FALSE, always. Any non-zero value is considered TRUE. Here I use the modulus operator, %%, which provides the remainder of a division.

1:20 %% 2

 [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

which used as indices give us

data[ (1:20 %% 2) > 0 ]

 [1] -0.1938137  0.3264208  0.1067894  0.3456918  0.4323083 -0.2968339
 [7]  0.4695398  0.8634220 -0.8200834  0.0129408

You can get as complicated in the creation of indices as you like, even using logical operators such as OR and AND. I leave that as an example for you to play with.