3 Data Containers
We almost never work with a single datum1, rather we keep lots of data. Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions), continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic. R has a very rich set of containers into which we can stuff our data as we work with it. Here these container types are examined and the restrictions and benefits associated with each type are explained.
3.1 Vectors
We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example). A vector of objects is simply a collection of them, often created using the c()
function (c for combine). Vectorized data is restricted to having homogeneous data types—you cannot mix character and numeric types in the same vector. If you try to mix types, R will either coerce your data into a reasonable type
x <- c(1,2,3)
x
## [1] 1 2 3
y <- c(TRUE,TRUE,FALSE)
y
## [1] TRUE TRUE FALSE
z <- c("I","am","not","a","looser")
z
## [1] "I" "am" "not" "a" "looser"
or coearce them into one type that is amenable to all the types of data that you have given it. In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.
w <- c(TRUE, "1", pi, ls())
w
## [1] "TRUE" "1" "3.14159265358979"
## [4] "x" "y" "z"
class(w)
## [1] "character"
Accessing elements within a vector are done using the square bracket []
notation. All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages). Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.
x
## [1] 1 2 3
x[1] <- 2
x[3] <- 1
x
## [1] 2 2 1
x[2]
## [1] 2
A common type of vector is that of a sequences. We use sequences all the time, to iterate through a list, to counting generations, etc. There are a few ways to generate sequences, depending upon the step sequence. For a sequence of whole numbers, the easiest is through the use of the colon operator.
x <- 1:6
x
## [1] 1 2 3 4 5 6
This provides a nice shorthand for getting the values X:Y from X to Y, inclusive. It is also possible to go backwards using this operator, counting down from X to Y as in:
x <- 5:2
x
## [1] 5 4 3 2
The only constraint here is that we are limited to a step size of 1.0. It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.
x <- 3.2:8.4
x
## [1] 3.2 4.2 5.2 6.2 7.2 8.2
If you are interested in making a sequence with a step other than 1.0, you can use the seq()
function. If you do not provide a step value, it defaults to 1.0.
y <- seq(1,6)
y
## [1] 1 2 3 4 5 6
But if you do, it will use that instead.
z <- seq(1,20,by=2)
z
## [1] 1 3 5 7 9 11 13 15 17 19
It is also possible to create a vector of objects as repetitions using the rep()
(for repeat) function.
rep("Beetlejuice",3)
## [1] "Beetlejuice" "Beetlejuice" "Beetlejuice"
If you pass a vector of items to rep()
, it can repeat these as either a vector being repeated (the default value)
x <- c("No","Free","Lunch")
rep(x,time=3)
## [1] "No" "Free" "Lunch" "No" "Free" "Lunch" "No" "Free" "Lunch"
or as each item in the vector repeated.
rep(x,each=3)
## [1] "No" "No" "No" "Free" "Free" "Free" "Lunch" "Lunch" "Lunch"
3.2 Matrices
A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types. There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often. Here I restrict myself to only 2-dimensional matrices.
You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want. The easiest matrix to try is one with empty values:
matrix(nrow=2, ncol=2)
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
Perhaps more useful is one that is pre-populated with values.
matrix(1:4, nrow=2 )
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Notice that here, there were four entries and I only specified the number of rows required. By default the ‘filling-in’ of the matrix will proceed down column (by-column). In this example, we have the first column with the first two entries and the last two entries down the second column. If you want it to fill by row, you can pass the optional argument
matrix(1:4, nrow=2, byrow=TRUE)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
and it will fill by-row.
When filling matrices, the default size and the size of the data being added to the matrix are critical. For example, I can create a matrix as:
Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
or
X <- matrix(c(1,2,3,4,5,6),nrow=2)
X
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
and both produce a similar matrix, only transposed.
X == t(Y)
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
In the example above, the number of rows (or columns) was a clean multiple of the number of entries. However, if it is not, R will fill in values.
X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)
Notice how you get a warning from the interpreter. But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.
X
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 1 2
The dimensionality of a matrix (and data.frame
as we will see shortly) is returned by the dim()
function. This will provide the number of rows and columns as a vector.
dim(X)
## [1] 2 4
Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give [row,col]
indices. Again, these are 1-based so that
X[1,3]
## [1] 3
is the entry in the 1st row and 3rd column.
You can also use ‘slices’ through a matrix to get the rows
X[1,]
## [1] 1 2 3 4
or columns
X[,3]
## [1] 3 1
of data. Here you just omit the index for the entity you want to span. Notice that when you grab a slice, even if it is a column, is given as a vector.
length(X[,3])
## [1] 2
You can grab a sub-matrix using slices if you give a range (or sequence) of indices.
X[,2:3]
## [,1] [,2]
## [1,] 2 3
## [2,] 6 1
If you ask for values from a matrix that exceed its dimensions, R will give you an error.
X[1,8]
## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
There are a few cool extensions of the rep()
function that can be used to create matrices as well. They are optional values that can be passed to the function.
times=x
: This is the default option that was occupied by the ‘3’ in the example above and represents the number of times that first argument will be repeated.
each=x
This will take each element in the first argument are repeat themeach
times.
length.out=x
: This make the result equal in length tox
.
In combination, these can be quite helpful. Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix. To make the indices, I bind two columns together using cbind()
. There is a matching row binding function, denoted as rbind()
(perhaps not so surprisingly). What is returned is a matrix
indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6) )
indices
## [,1] [,2] [,3]
## [1,] 1 1 5
## [2,] 1 2 5
## [3,] 1 3 5
## [4,] 2 1 5
## [5,] 2 2 5
## [6,] 2 3 5
3.3 Lists
A list is a type of vector but is indexed by ‘keys’ rather than by numeric indices. Moreover, lists can contain heterogeneous types of data (e.g., values of different class
), which is not possible in a vector type. For example, consider the list
theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)
## Length Class Mode
## x 20 -none- numeric
## dog 5 -none- character
## hasStyle 5 -none- logical
which is defined with a numeric, a character, and a logical component. Each of these entries can be different in length as well as type. Once defined, the entries may be observed as:
theList
## $x
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
##
## $dog
## [1] "A" "B" "C" "D" "E"
##
## $hasStyle
## [1] FALSE FALSE FALSE FALSE FALSE
Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.
theList$my_favoriate_number <- 2.9 + 3i
or use double brackets and the name of the variable as a character string.
theList[["lotto numbers"]] <- rpois(7,lambda=42)
The keys currently in the list are given by the names()
function
names(theList)
## [1] "x" "dog" "hasStyle"
## [4] "my_favoriate_number" "lotto numbers"
Getting and setting values within a list are done the same way using either the $
-operator
theList$x
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
theList$x[2] <- 42
theList$x
## [1] 2 42 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
or the double brackets
theList[["x"]]
## [1] 2 42 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
or using a numeric index, but that numeric index is looks to the results of names()
to figure out which key to use.
theList[[2]]
## [1] "A" "B" "C" "D" "E"
The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the names()
function (dog in this case). If you want to access elements within that variable, then you add a second set of brackets on after the double ones.
theList[[1]][3]
## [1] 6
This deviates from the matrix approach as well as from how we access entries in a data.frame
(described next). It is not a single square bracket with two indices, that gives you an error:
theList[1,3]
## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
List are rather robust objects that allow you to store a wide variety of data types (including nested lists). Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.
3.4 Data Frames
The data.frame
is the default data container in R. It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database. If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data. There may be factors
representing designations such as species, regions, populations, sex, flower color, etc. Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content. You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.
On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, NA
, in each row. Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined. A data.frame
is similar to this, at least conceptually. You define a data.frame
by designating the columns of data to be used. You do not need to define all of them, more can be added later. The values passed can be sequences, collections of values, or computed parameters. For example:
df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df
## ID Names Score
## 1 1 Bob 91
## 2 2 Alice 94
## 3 3 Vicki 90
## 4 4 John 87
## 5 5 Sarah 93
You can see that each column is a unified type of data and each row is equivalent to a record. Additional data columns may be added to an existing data.frame as:
df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)
Since we may have many (thousands?) of rows of observations, a summary()
of the data.frame can provide a more compact description.
summary(df)
## ID Names Score Passed_Class
## Min. :1 Alice:1 Min. :87 Mode :logical
## 1st Qu.:2 Bob :1 1st Qu.:90 FALSE:1
## Median :3 John :1 Median :91 TRUE :4
## Mean :3 Sarah:1 Mean :91 NA's :0
## 3rd Qu.:4 Vicki:1 3rd Qu.:93
## Max. :5 Max. :94
We can add columns of data to the data.frame after the fact using the $
-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.
3.4.1 Indexing Data Frames
You can access individual items within a data.frame
by numeric index such as:
df[1,3]
## [1] 91
You can slide indices along rows (which return a new data.frame
for you)
df[1,]
## ID Names Score Passed_Class
## 1 1 Bob 91 TRUE
or along columns (which give you a vector of data)
df[,3]
## [1] 91 94 90 87 93
or use the $
-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.
df$Names[3]
## [1] Vicki
## Levels: Alice Bob John Sarah Vicki
Indices are ordered just like for matrices, rows first then columns. You can also pass a set of indices such as:
df[1:3,]
## ID Names Score Passed_Class
## 1 1 Bob 91 TRUE
## 2 2 Alice 94 TRUE
## 3 3 Vicki 90 TRUE
It is also possible to use logical operators as indices. Here I select only those names in the data.frame whose score was >90 and they passed popgen.
df$Names[df$Score > 90 & df$Passed_Class==TRUE]
## [1] Bob Alice Sarah
## Levels: Alice Bob John Sarah Vicki
This is why data.frame
objects are very database like. They can contain lots of data and you can extract from them subsets that you need to work on. This is a VERY important feature, one that is vital for reproducible research. Keep you data in one and only one place.
The word data is plural, datum is singular↩