4 Manipulating Data

The sine qua non of of this course is instruction in how how to manipulate data. In this class activity, we will focus on some built-in data sets to allow you to work in R and get more keyboard time in RStudio. There will be a series of questions at the end of the lecture that will ask you to retrieve certain information from various data sets.

4.1 Data Import & Export

Raw data is imported into R using read.* functions. There are a wide variety of file formats available, most of which have a corresponding function (e.g., read.csv(), read.delim(), read.dcf(), etc.). For our purposes, we will focus on using comma separated files (*.CSV) as they are the most readily available and can be read by almost all editors and spreadsheet functions.

On the lecture webpage, there is a file named iris.csv. Download this file and put it in the same directory as your RStudio session. If you do not know where this is, you can find it by asking R to get its current working directory as:

getwd()

## [1] "/Users/rodney/Documents/Teaching/EnvironmentalDataLiteracy"

The same information is also printed across the top of the “Files” pane in the RStudio interface (though it starts from your ‘home’ directory instead of the top of the file path).

One way to easily open this location is to select the “Show Folder in New Window” menu item in the “More” menu on that same pane. It will open the folder you are looking at in the file system as a new window for you, then you can drag and drop things into it.

Keep in mind that R is running in a specific location on your computer. This working directory is where it looks for stuff if you do not give a complete file path (e.g., ‘C:\Users...’ or ‘/Users/…’ on winblows and mac, respectively). To load in a CSV file, we can use the function

data <- read.csv("file.csv")

where at a bare minimum, we need to have the name of the file (in the example above it was ‘file.csv’). There are a lot of additional arguments you can pass to read.csv() including:

header = TRUE: Does the file have a header row that gives the variable names?
sep = ",": What is the column separator. By default for CSV, it is a comma.
quote = "\"": Is there text that is quoted within the body of the document?
dec = ".": What is the decimal character?
fill = TRUE: Do you want to fill in the empty data cells or do all rows of data have the same amount of data.
comment.char = "": Are there comments in the text?

Additional options are available if you look at the help file as:

?read.csv

Once you have read the file in (it will complain with an error message if it does not work), then you will have a data.frame object named data (from the example above, you should of course name it something more descriptive).

$\;$

Saving materials in R is a bit easier. If you are needing to export the file back into a CSV format then you can use write.csv() (see ?write.csv for specifics and examples) and it will write the file as a text file. However, if you are only working in R with that file, you can save it as an R object without translating it back and forth through a CSV file. Using the example data from above, you could save the data.frame as an R data object using:

save( data, file="mydata.rda")

and it will save the object. Next time you need it, you can load it in using:

load("mydata.rda")

4.2 Diving Into Data

In this next section, we will walk through the example iris data set and highlight some of the ways that you can manipulate data.frame objects. If you have loaded in the data set as outlined above, we should be able to get a first look at it using the summary() function.

summary(flowers)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Each of the columns in the data.frame is named, the first four of which are numerical values and the last one is a species designation.

4.3 Factor & Character Data

Depending upon how you have set up your R session, the last column may be either a character type or a factor. By default, some builds of R turn all string characters into factors (a behavior I do not like because there are a lot of data types that I load in that are best defined as non-factors).

In this data set though, it is probably best if the Species column were really a factor and not just a character data type. We will play with this data set a bit and we will use the Species as a defining category (i.e., a factor).

flowers$Species <- factor( flowers$Species )
summary(flowers)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Notice in the output above how the factor column is tabulated by a count of the number of observations whereas the summary of the character data is just lumped all together. Also remember, if you do not specify the factors as ordered=TRUE then it will assume that they are categorical data without ordinal information. In this case, there is no reason to assume that one of these species types is greater than or less than another one so these are unordered factors.

Conversely, if you have imported some data and it was automatically interpreted as factors, you can change a factor back to a character type by using the as.character() function. I find in my interactions with data, I use a lot more textual data than factors so I set the default up to not automatically translate it into factors, only doing so when I need to.

In the example above, I replaced the original column of data with the factor version, though you may not need to do that. You can easily add another column of data to the data.frame giving it a new name. In the example below, I take the flowers$Species column and make a new one named Taxonomy by pasting the genus of the flowers onto it.

flowers$Taxonomy <- paste( "Iris", flowers$Species )
summary(flowers)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species     Taxonomy        
##  setosa    :50   Length:150        
##  versicolor:50   Class :character  
##  virginica :50   Mode  :character  
##                                    
##                                    
##

$\;$

4.4 Indexing

The normal workflow in R consists of loading data into your session, manipulating the data, and performing operations—statistical, summary, or graphical–on it (or some subset of it). Each element in the data.frame is indexed by the row and column number. The order of the columns is as shown or can be viewed using

names(flowers)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"      "Taxonomy"

So if we wanted to access the Sepal.Width of 3$^{rd}$ observation, we could use the numerical indices (where the “Sepal.Width” is the second name in this list):

flowers[3,2]

## [1] 3.2

or by accessing the 3$^{rd}$ element of the “Sepal.Width” vector.

flowers$Sepal.Width[3]

## [1] 3.2

If you pull observations from a data.frame, you either get a new data.frame (if you include more than one column)

df <- flowers[,c(2,3,5)]
class(df)

## [1] "data.frame"

summary(df)

##   Sepal.Width     Petal.Length         Species  
##  Min.   :2.000   Min.   :1.000   setosa    :50  
##  1st Qu.:2.800   1st Qu.:1.600   versicolor:50  
##  Median :3.000   Median :4.350   virginica :50  
##  Mean   :3.057   Mean   :3.758                  
##  3rd Qu.:3.300   3rd Qu.:5.100                  
##  Max.   :4.400   Max.   :6.900

or a vector of data (if you only have one column or any subset of one column).

sepal_width <- flowers$Sepal.Width
length(sepal_width)

## [1] 150

class(sepal_width)

## [1] "numeric"

sepal_width

##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
##  [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
##  [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
##  [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
##  [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
##  [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

In addition to numerical values, you can also use logical statements to select subsets of your data. Here is an example of all the data whose flowers$Sepal.Width is less than 3.0cm and those that have sepals as large or bigger than 3.0cm.

small_sepals <- flowers[ flowers$Sepal.Width < 3.0, ]
big_sepals <- flowers[ flowers$Sepal.Width >= 3.0, ]
summary( small_sepals )

##   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width   
##  Min.   :4.400   Min.   :2.00   Min.   :1.300   Min.   :0.200  
##  1st Qu.:5.600   1st Qu.:2.50   1st Qu.:4.000   1st Qu.:1.200  
##  Median :6.000   Median :2.70   Median :4.500   Median :1.400  
##  Mean   :5.953   Mean   :2.64   Mean   :4.509   Mean   :1.449  
##  3rd Qu.:6.300   3rd Qu.:2.80   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.700   Max.   :2.90   Max.   :6.900   Max.   :2.400  
##        Species     Taxonomy        
##  setosa    : 2   Length:57         
##  versicolor:34   Class :character  
##  virginica :21   Mode  :character  
##                                    
##                                    
##

summary( big_sepals )

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :3.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.000   1st Qu.:3.000   1st Qu.:1.500   1st Qu.:0.200  
##  Median :5.600   Median :3.200   Median :1.900   Median :0.500  
##  Mean   :5.776   Mean   :3.313   Mean   :3.298   Mean   :1.046  
##  3rd Qu.:6.500   3rd Qu.:3.500   3rd Qu.:5.200   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.700   Max.   :2.500  
##        Species     Taxonomy        
##  setosa    :48   Length:93         
##  versicolor:16   Class :character  
##  virginica :29   Mode  :character  
##                                    
##                                    
##

$\;$

There are two ways you can also merge data.frame objects. You can add data onto the bottom of the data.frame by using the rbind() function (row-binding). This requires that both data.frame objects have the same column names (and in the same order).

all_sepals <- rbind( small_sepals, big_sepals )
summary( all_sepals )

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species     Taxonomy        
##  setosa    :50   Length:150        
##  versicolor:50   Class :character  
##  virginica :50   Mode  :character  
##                                    
##                                    
##

You can also merge together two data.frames that have a common column index. This may be necessary for cases where you have different kinds of observations that need to be merged into a single data.frame.

Here is an example where I have some data on my sampling sites

sites <- data.frame( Site=c("Olympia","Richmond"), 
                     Latitude=c(47.0379,37.5407), 
                     Longitude=c(122.9007,77.4360) )
sites

##       Site Latitude Longitude
## 1  Olympia  47.0379  122.9007
## 2 Richmond  37.5407   77.4360

and another set of data that has some observations on samples taken from each site.

data <- data.frame( Site=c("Olympia","Olympia","Olympia","Richmond","Richmond"))
data$Measurement <- c(12,22,35,56,46)
data

##       Site Measurement
## 1  Olympia          12
## 2  Olympia          22
## 3  Olympia          35
## 4 Richmond          56
## 5 Richmond          46

If I wanted to merge these two data.frame objects into a single one, incorporating each of the columns of data in both, I would

df <- merge( sites, data )
df

##       Site Latitude Longitude Measurement
## 1  Olympia  47.0379  122.9007          12
## 2  Olympia  47.0379  122.9007          22
## 3  Olympia  47.0379  122.9007          35
## 4 Richmond  37.5407   77.4360          56
## 5 Richmond  37.5407   77.4360          46

Here the merge() function looks for a column that has the same name in both data.frame objects. In this case it was “Site”. It then uses that as an index to merge both together into a new object.

4.5 Sorting & Ordering

The order in which rows of observations are in the data.frame is determined by their placement in the original file. If you look at the data, it seems to be sorted by flowers$Species but nothing after that.

head(flowers)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    Taxonomy
## 1          5.1         3.5          1.4         0.2  setosa Iris setosa
## 2          4.9         3.0          1.4         0.2  setosa Iris setosa
## 3          4.7         3.2          1.3         0.2  setosa Iris setosa
## 4          4.6         3.1          1.5         0.2  setosa Iris setosa
## 5          5.0         3.6          1.4         0.2  setosa Iris setosa
## 6          5.4         3.9          1.7         0.4  setosa Iris setosa

You can sort the whole data.frame by asking for a copy of it with a specific order based upon the columns. Here I will re-assign the data.frame but this time ordered by flowers$Sepal.Length.

flowers <- flowers[ order(flowers$Sepal.Length), ]
head(flowers)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species    Taxonomy
## 14          4.3         3.0          1.1         0.1  setosa Iris setosa
## 9           4.4         2.9          1.4         0.2  setosa Iris setosa
## 39          4.4         3.0          1.3         0.2  setosa Iris setosa
## 43          4.4         3.2          1.3         0.2  setosa Iris setosa
## 42          4.5         2.3          1.3         0.3  setosa Iris setosa
## 4           4.6         3.1          1.5         0.2  setosa Iris setosa

Note the row indices (the column on the far left), they indicate the original order in which the observations were put into the data.frame. The smallest sepal length in the data set was originally the 14$^{th}$ observation.

You see from the example above that the addition of new columns to a data.frame result in them being put on the right-hand side data.frame (e.g., flowers$Taxonomy is the last column of the output). If you want to rearrange the columns of the data.frame, you do so by manually re-arranging the indices on those columns. For example, if I wanted to make the numerical data the last four columns instead of the first two, I would specify it as:

flowers <- flowers[ , c(6,5,1,2,3,4) ]
summary(flowers)

##    Taxonomy               Species    Sepal.Length    Sepal.Width   
##  Length:150         setosa    :50   Min.   :4.300   Min.   :2.000  
##  Class :character   versicolor:50   1st Qu.:5.100   1st Qu.:2.800  
##  Mode  :character   virginica :50   Median :5.800   Median :3.000  
##                                     Mean   :5.843   Mean   :3.057  
##                                     3rd Qu.:6.400   3rd Qu.:3.300  
##                                     Max.   :7.900   Max.   :4.400  
##   Petal.Length    Petal.Width   
##  Min.   :1.000   Min.   :0.100  
##  1st Qu.:1.600   1st Qu.:0.300  
##  Median :4.350   Median :1.300  
##  Mean   :3.758   Mean   :1.199  
##  3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :6.900   Max.   :2.500

4.6 Manipulating Data

As we saw above, you can change data either in-place or by adding new columns (and you could also drop columns by reordering them and just not include all the columns in the column indices). You can also perform operations on columns of data, again either in-place, as an additional column, or not connected at all with the data.frame. To do it in-place, you simply re-assign the values after the calculation. For example, here is how I would subtract the average “Sepal.Length” from all the observations.

ave_sepal_length <- mean( flowers$Sepal.Length )
standardized_sepal_lengths <- flowers$Sepal.Length - ave_sepal_length
summary( standardized_sepal_lengths )

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.54300 -0.74330 -0.04333  0.00000  0.55670  2.05700

Which should have a mean of zero, right?

$\;$

4.7 Tabulating

We can finish up this exploration of the iris data set by doing some summaries. One of the main strengths of R is that it is a functional language. This may not mean much to you now, but in the long run you will find that you can perform opertions on large data sets by merging together cleaver use of indices and the application of one or more functions.

Consider the case where we are trying to find the mean for each of the three species. Proceedurally, we must do the following general steps:

Go through each row of the data and figure out which species we are looking at.
Add the length of this observation to a tally variable, one for each species.
Count how many observations of each species we have in the data.frame.
Divide the approporiate tally variable by the number of observations.
Return or print out the results.

Not a trivial thing to do. If you were going to to this literally, you would have to set up three tally variables, three observation count variables, and then do a loop and run through all the observations making decisions on which observation is which and where to add the tally and count.

Fortunately for us, this is not that difficult of an issue in R for the following two reasons:
1. We have a function mean() that estimates the mean of a set of variables. 2. We have a factor varaible in flowers that differentiates between species.

As such we can then ask R to estimate the mean in “Sepal.Length” by “Species” using

mean_sepal_length <- by( flowers$Sepal.Length, flowers$Species, mean)
mean_sepal_length

## flowers$Species: setosa
## [1] 5.006
## -------------------------------------------------------- 
## flowers$Species: versicolor
## [1] 5.936
## -------------------------------------------------------- 
## flowers$Species: virginica
## [1] 6.588

If you read this literally, it says, “Wit Sepal Length, partition the data by Species and pass it to the function mean”. This shorthand is important in that it highlights the flexibility and utility of applying functions to our data. With a few keystrokes, we can accomplish a lot of computational progress!

You could find these values by slicing as:

# literally grab the mean of all sepal lengths WHERE species is identical to X
mean_setosa <- mean( flowers$Sepal.Length[ flowers$Species == "setosa" ] )
mean_versicolor <- mean( flowers$Sepal.Length[ flowers$Species == "versicolor" ] )
mean_virginica <- mean( flowers$Sepal.Length[ flowers$Species == "virginica" ] )
mean_sepal_lengths_long <- c(setosa=mean_setosa, versicolor=mean_versicolor, virginica=mean_virginica )
mean_sepal_lengths_long

##     setosa versicolor  virginica 
##      5.006      5.936      6.588

But it sure does look like a lot more code to write! In general, we should strive to keep our code as easy to understand as possible. In the end, you will be reading your own code and you must be able to easily understand it at some random time in the future. For this to happen, you really need to clear on what you are trying to do.

$\;$