The actual process of analysis can be constructed from a set of verb actions that we apply to raw data so that we can extract meaningful insights.
For this topic, we will use a moderate sized data set from the Rice Rivers Center which contains water and atmospheric data from a stream of sensors in both the James River and on the bluff overlooking the river.
library( readr )
url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
These data have 8199 records measured on the 23 columns and represent samples collected every 15-minutes starting on 1/1/2014 12:00:00 AM and ending on 3/27/2014 9:30:00 AM.
names( rice )
[1] "DateTime" "RecordID"
[3] "PAR" "WindSpeed_mph"
[5] "WindDir" "AirTempF"
[7] "RelHumidity" "BP_HG"
[9] "Rain_in" "H2O_TempC"
[11] "SpCond_mScm" "Salinity_ppt"
[13] "PH" "PH_mv"
[15] "Turbidity_ntu" "Chla_ugl"
[17] "BGAPC_CML" "BGAPC_rfu"
[19] "ODO_sat" "ODO_mgl"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
DateTime
RecordID
PAR
WindSpeed_mph
WindDir
AirTempF
RelHumidity
BP_HG
Rain_in
H2O_TempC
SpCond_mScm
Salinity_ppt
PH
PH_mv
Turbidity_ntu
Chla_ugl
BGAPC_CML
BGAPC_rfu
ODO_sat
ODO_mgl
Depth_ft
Depth_m
SurfaceWaterElev_m_levelNad83m
If we think about it, there are some fundamental processes that we use to work with and manipulate data. These include:
ggplot
will use to plot).Species == setosa
on the iris
data set).rice$AirTempF
to celcius).data.frame
(e.g., sort by rice$Detph_m
to find the lowest-or hightest-tides in the data frame).factor
or other column of data (e.g., separating the values of Sepal.Length
for each of the Species
groups in the iris
data set).mean
value of Sepal.Length
for each Species
in the iris
data set).These verbs and/or actions can be combined in many ways to allow us to extract information from raw data. Examples may include:
R
ApproachesTo begin, we will walk through the ways that this can be done using normal R
syntax, which will include a being cleaver in how we use indices, mix in a bunch of logical operators, and a smattering of $
operators, and we are good to go!
To select columns, we use either the column names as character objects in the column index position (e.g., after the comma in the square brackets1)
df <- rice[, c("DateTime","PAR","WindDir","PH")]
summary( df )
DateTime PAR WindDir PH
Length:8199 Min. : 0.000 Min. : 0.00 Min. :6.43
Class :character 1st Qu.: 0.000 1st Qu.: 37.31 1st Qu.:7.50
Mode :character Median : 0.046 Median :137.30 Median :7.58
Mean : 241.984 Mean :146.20 Mean :7.60
3rd Qu.: 337.900 3rd Qu.:249.95 3rd Qu.:7.69
Max. :1957.000 Max. :360.00 Max. :9.00
NA's :1
or the column indices in the square-brackets.
df <- rice[ c(1,3,5,13)]
summary( df )
DateTime PAR WindDir PH
Length:8199 Min. : 0.000 Min. : 0.00 Min. :6.43
Class :character 1st Qu.: 0.000 1st Qu.: 37.31 1st Qu.:7.50
Mode :character Median : 0.046 Median :137.30 Median :7.58
Mean : 241.984 Mean :146.20 Mean :7.60
3rd Qu.: 337.900 3rd Qu.:249.95 3rd Qu.:7.69
Max. :1957.000 Max. :360.00 Max. :9.00
NA's :1
As you can see, the first one is probably more effective than the second one because by simply looking at the code, we can see what columns we are going to grab. Moreover, if you are using RStudio
for this, you should be able to get the very helpful autocorrect to pop up and give you the names of the columns.
Using numbers has no help like this.
To filter rows, we can do the same thing as for selecting columns. Here we can either numerical values to take slices (e.g., I’m grabbing the first 96 entries which correspond to all the entries taken on January 1, 2014):
df1 <- df[ 1:96, ]
head( df1 )
Or we can use logical operators that test some logical
condition based upon the values in the data.frame
.
For logical operators (returning TRUE
, FALSE
or NA
), the following are available:
Operator | Definition |
---|---|
!= |
Not equal to |
== |
Equal to |
> |
Strictly greater than |
>= |
Greater than OR equal to |
< |
Strictly less than |
<= |
Less than or equal to. |
For example, maybe I only want estimates where PAR (Photosynthetically Active Radition—the spectral range ofsolar radiation from 400 to 700 nanometers that photosynthetic organisms are able to use in the process of photosynthesis) is greater than zero (e.g., daytime to plants).
df1 <- df[ df$PAR > 0, ]
summary( df1 )
DateTime PAR WindDir PH
Length:4967 Min. : 0.003 Min. : 0.00 Min. :6.48
Class :character 1st Qu.: 3.828 1st Qu.: 48.94 1st Qu.:7.51
Mode :character Median : 213.300 Median :182.80 Median :7.59
Mean : 399.442 Mean :164.10 Mean :7.61
3rd Qu.: 679.250 3rd Qu.:258.90 3rd Qu.:7.71
Max. :1957.000 Max. :359.90 Max. :8.64
NA's :1
We can also use functions that return TRUE
or FALSE
such as is.na()
df[ is.na(df$PH), ]
or mathematical expressions (here I use the moduluo operator—the remainder left over after normal division) to grab every other row. If I apply it to a sequence of values, I can get every even index
(1:10 %% 2)==0
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
or every odd index
(1:10 %% 2) == 1
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
or every 3 (or whatever)
(1:20 %% 3) == 0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
We can even select a random subset of rows using the sample()
function. Here I’ll select 5 rows randomly (see ?sample
for more specifics on using this):
idx <- sample( 1:nrow(df), size=5, replace=FALSE)
idx
[1] 5168 1248 5547 505 1755
df[ idx, ]
We can combine individual statements using AND
as well as OR
operators to make more complicated selections. In R, to combine logical operators we use the &
for AND
and |
for OR
.
So if I’m looking for a sample of data where the is light and/or the wind is coming from a certain direction, I could combine these operators as (n.b., I am only showing rows 25-35 as there are some interesting combinsions here):
par <- df$PAR > 0
wind <- df$WindDir < 30
cbind( par, wind )[30:35,]
par wind
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,] TRUE TRUE
[5,] TRUE FALSE
[6,] TRUE FALSE
To look at what happens when we combine them, using &
cbind( par, wind, par & wind)[30:35,]
par wind
[1,] FALSE TRUE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE TRUE FALSE
[4,] TRUE TRUE TRUE
[5,] TRUE FALSE FALSE
[6,] TRUE FALSE FALSE
as well as |
cbind( par, wind, par | wind)[30:35,]
par wind
[1,] FALSE TRUE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE
[4,] TRUE TRUE TRUE
[5,] TRUE FALSE TRUE
[6,] TRUE FALSE TRUE
So to put it all together as indices for df
we can just shove these into the square brackets:
df1 <- df[ df$PAR > 0 & df$WindDir < 30, ]
head( df1 )
To mutate columns of data in a data.frame
, we use re-assignment. As an example, let’s jump into the first column of data in the rice
data, the character column with Date & Time values. As it stands now, these data are of class:
class( rice$DateTime )
[1] "character"
character
So, the value in
rice$DateTime[234]
[1] "1/3/2014 10:15:00 AM"
1/3/2014 10:15:00 AM
is simply a string of characters, as far as R
is concerned. Dates and times are sticky things in data analysis because they do not work the way we think they should. Here are some wrinkles:
Fortunately, some smart programmers have figured this out for us already. What they did is made the second as the base unit of time and designated 00:00:00 on 1 January 1970 as the unix epoch. Time on most modern computers is measured from that starting point. It is much easier to measure the difference between two points in time using the seconds since unix epich and then translate it into one or more of these caledars than to deal with all the different calendars each time. So under the hood, much of the date and time issues are kept in terms of epoch seconds.
unclass( Sys.time() )
[1] 1605102028
To make date data types you need to ahve a raw character
string such as
rice$DateTime[1]
[1] "1/1/2014 12:00:00 AM"
1/1/2014 12:00:00 AM
and then we need to be able to tell the conversion function what the various elements within that string represent. There are many ways to make dates and time, 10/14 or 14 Oct or October 14 or Julian day 287, etc. These are designated by a format string were we indicate what element represents a day or month or year object. These are found by looking at the documentation for?strptime
.
In our case, we have:
- Month as 1 or 2 digits
- Day as 1 or 2 digits
- Year as 4 digits
- a space to separate date from time
- hour (not 24-hour though)
- minutes in 2 digits
- seconds in 2 digits
- a space to separate time from timezone
- timezone
- /
separating date objects
- :
separating time objects
To make the format string, we need to look up how to encode these items. Essentially, the format consists of a percent sign with an upper or lower case letter to represent all these objects (except for the whitespace and punctuation parts).
The items in rice
for a date & time object such as 1/1/2014 11:00:00 AM are encoded as:
format <- "%m/%d/%Y %I:%M:%S %p"
Now, we can convert the character string in the data frame
record <- rice$DateTime[1]
record
[1] "1/1/2014 12:00:00 AM"
1/1/2014 12:00:00 AM
class( record )
[1] "character"
character
to a Date object. I like using the lubridate
library2 as it has a lot of additional functionality that we’ll play with a bit later.
library( lubridate )
x <- parse_date_time( record, orders=format, tz = "EST" )
x
[1] "2014-01-01 EST"
class(x)
[1] "POSIXct" "POSIXt"
POSIXct
POSIXt
Very cool.
So, now let’s apply that and mutate the original data, replacing the date and time column with a Date object.
rice$Date <- parse_date_time( rice$DateTime,
orders=format,
tz = "EST")
summary( rice$Date )
Min. 1st Qu. Median
"2014-01-01 00:00:00" "2014-01-22 08:22:30" "2014-02-12 16:45:00"
Mean 3rd Qu. Max.
"2014-02-12 16:45:00" "2014-03-06 01:07:30" "2014-03-27 09:30:00"
Now, we can ask Date-like questions about the data such as what day of the week was the first sample taken?
weekdays( rice$Date[1] )
[1] "Wednesday"
Wednesday
What is the range of dates?
range( rice$Date )
[1] "2014-01-01 00:00:00 EST" "2014-03-27 09:30:00 EST"
What is the median of samples
median( rice$Date )
[1] "2014-02-12 16:45:00 EST"
and what julian ordinal day (e.g., how many days since start of the year) is the last record.
yday( rice$Date[8199] )
[1] 86
Just for fun, I’ll add a column to the data that has weekday.
rice$Weekday <- weekdays( rice$Date )
summary( rice$Weekday )
Length Class Mode
8199 character character
class( rice$Weekday )
[1] "character"
character
However, we should probably turn it into a factor (e.g., a data type with pre-defined levels—and for us here—an intrinsic order of the levels).
rice$Weekday <- factor( rice$Weekday,
ordered = TRUE,
levels = c("Monday","Tuesday","Wednesday",
"Thursday", "Friday",
"Saturday", "Sunday")
)
summary( rice$Weekday )
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
1152 1152 1248 1191 1152 1152 1152
To arrange data in a data.frame, we order
it. Let’s say we want sort in decreasing order of PAR
.
df <- rice[ order( rice$PAR, decreasing=TRUE), ]
head( df )
For sorting with more than one column, just add it to the order()
function.
df <- rice[ order( rice$PAR, rice$WindSpeed_mph, decreasing = TRUE), ]
head( df )
(compare the last two rows between these two outputs).
As we showed in the lecture on basic graphics on the slide discussing making data for the barplot, we can group data by existing data columns (Species in that case or Weekday in this one) by filtering the rows under some condition.
monday <- rice[ rice$Weekday == "Monday",]
max( monday$PAR )
[1] 1733
tuesday <- rice[ rice$Weekday == "Tuesday",]
max( tuesday$PAR )
[1] 1622
To summarize some data, we did have a shortcut back on those barplot slides using the by()
function.
So for PAR, we can look at maximum values by day of the week
maxPAR <- by( rice$PAR, rice$Weekday, max )
barplot( maxPAR, las=2 , ylab="PAR", main="Maximum PAR at the Rice Center")
If we wanted to make a data.frame from the summary values, say to get the mean non-zero values of PAR and the standard deviation of records taken at noon each day we’d have to combine some of the methods above.
noon <- seq( rice$Date[49], rice$Date[7729], length.out = 81 )
noon[1:10]
[1] "2014-01-01 12:00:00 EST" "2014-01-02 12:00:00 EST"
[3] "2014-01-03 12:00:00 EST" "2014-01-04 12:00:00 EST"
[5] "2014-01-05 12:00:00 EST" "2014-01-06 12:00:00 EST"
[7] "2014-01-07 12:00:00 EST" "2014-01-08 12:00:00 EST"
[9] "2014-01-09 12:00:00 EST" "2014-01-10 12:00:00 EST"
Notice what I did here. The Date
variable we made above knows how to make sequences out of date objects because it knows what constitutes a day. I took the noon entry for 1/1/2014 and the noon entry for 3/22/2014 and made a sequence equal in length to the number of days between them. Next I can pull out only the entries in the full data set equal to values in that vector (I use the %in%
operator, which is a set operator) and make a new data set from just the noon entries.
rice_noon <- rice[ rice$Date %in% noon, ]
And then make a new data set of both the mean and standard deviation of those values.
df <- data.frame( Weekday = levels( rice_noon$Weekday ) )
df$PAR <- by( rice_noon$PAR, rice_noon$Weekday, mean )
df$sd <- by( rice_noon$PAR, rice_noon$Weekday, sd )
Then we could use it to make a table:
library( knitr )
kable( df, output="html",digits = 2,caption = "Table 1: Mean and standard deviation of PAR at the Rice Center in 2014." )
Weekday | PAR | sd |
---|---|---|
Monday | 690.5000 | 443.2262 |
Tuesday | 680.8709 | 495.1324 |
Wednesday | 569.5000 | 285.6899 |
Thursday | 845.8917 | 500.8599 |
Friday | 980.6083 | 436.2707 |
Saturday | 845.9667 | 513.4987 |
Sunday | 937.5727 | 315.4537 |
or them using ggplot
library( ggplot2 )
ggplot( df, aes(x=Weekday, y=PAR) ) +
geom_col() +
geom_errorbar( aes(ymin=PAR, ymax=PAR+sd ) ) +
theme_classic()
HOLD THE BOAT!
Notice how those values on the x-axis do not conform to the ordering of the week. That is because we made this new data with just the levels but did not make them into an ordered factor. To clean up, let’s do that first.
df$Weekday <- factor( df$Weekday,
ordered=TRUE,
levels = levels( rice$Weekday) )
df$Workdays <- c( rep("Work Day", 5), "Weekend","Weekend")
summary( df )
Weekday PAR sd Workdays
Monday :1 Min. :569.5 Min. :285.7 Length:7
Tuesday :1 1st Qu.:685.7 1st Qu.:375.9 Class :character
Wednesday:1 Median :845.9 Median :443.2 Mode :character
Thursday :1 Mean :793.0 Mean :427.2
Friday :1 3rd Qu.:891.8 3rd Qu.:498.0
Saturday :1 Max. :980.6 Max. :513.5
Sunday :1
Now when we plot it, it looks correct (same plot code copied here but using modified data frame).
library( RColorBrewer )
ggplot( df, aes(x=Weekday, y=PAR) ) +
geom_errorbar( aes(ymin=PAR-sd, ymax=PAR+sd ) ) +
geom_col( aes( fill=Workdays ) ) +
theme_classic() +
scale_fill_brewer( type="qual",palette = 4)
R
OK, so the main workflows we adopt in R require us to do some fundamental data manipulations including tasks such as select, filter, mutate, arrange, group, and summarize. All basic tasks can be combined in various ways to yield informative insights into the data, such as that awesome noon PAR plot above, which incidentally shows that there is more sun at noon on Fridays than any other day implying that we should just take Fridays off to have more epic sunny weekends in winter!
While this is all fine and dandy, the process here requires a lot of complexity such as:
- Overuse of logical statements to do boolean math on indices.
- Lots of using the $-operator to reference specific columns of data within the data.frame
objects. - Overuse of reassignment for derivative data frames. Above, we made rice
, rice_noon
, and df
just to get that last figure above. In each of these cases, we are reallocating a new data.frame
variable each time. We could reassign these back onto themselves (e.g., reuse variable names) but each time we do this, we actually have to allocate and reallocate large blocks of memory and for data sets larger that these (which are commonly used) it becomes a bit of a pain. - The overall syntax is hard to read.
For these reasons, tidyverse
was made…
Remember that if you leave either the row or column index empty in the square brackets, it will include all observations (e.g., if you leave row empty all rows will be returned whereas if you leave column empty all columns will be returned).↩︎
If you get an error saying something like, “there is no package named lubridate” then use install.packages("lubridate")
and install it. You only need to do this once.↩︎