It is the flexible reed that survives the storm.
So in the last section, we discussed
library( readr )
url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv"
rice <- read_csv( url )
Parsed with column specification:
cols(
.default = col_double(),
DateTime = col_character()
)
See spec(...) for full column specifications.
What was the daytime air tempertures profiles for the each day during the first week of February?
library( lubridate )
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
format <- "%m/%d/%Y %I:%M:%S %p"
rice$Date <- parse_date_time( rice$DateTime,
orders=format,
tz="EST")
Make weekdays an ordered factor.
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
rice$Weekday <- weekdays( rice$Date )
rice$Weekday <- factor( rice$Weekday,
ordered=TRUE,
levels=days)
summary( rice$Weekday)
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
1152 1152 1248 1191 1152 1152 1152
Fix Air Temperature to be in Celsius
rice$AirTemp <- (rice$AirTempF - 32) * 5/9
hist( rice$AirTemp,
xlab="Air Temperature (°C)",
main="")
Figure 1: Air temperature (°C) measured at the Rice Rivers Center in Charles City County, Virginia during the first three months of 2014.
Which set of variables are we going to actually need?
df <- rice[, c("Date", "Weekday", "AirTemp", "PAR")]
summary( df )
Date Weekday AirTemp
Min. :2014-01-01 00:00:00 Monday :1152 Min. :-15.6950
1st Qu.:2014-01-22 08:22:30 Tuesday :1152 1st Qu.: -0.2528
Median :2014-02-12 16:45:00 Wednesday:1248 Median : 3.0222
Mean :2014-02-12 16:45:00 Thursday :1191 Mean : 3.7751
3rd Qu.:2014-03-06 01:07:30 Friday :1152 3rd Qu.: 8.0056
Max. :2014-03-27 09:30:00 Saturday :1152 Max. : 23.8167
Sunday :1152
PAR
Min. : 0.000
1st Qu.: 0.000
Median : 0.046
Mean : 241.984
3rd Qu.: 337.900
Max. :1957.000
Which set of rows are we going to operate on? - “First Week of February” - “Daytime”
Day Range
rice$DateTime[25]
[1] "1/1/2014 6:00:00 AM"
1/1/2014 6:00:00 AM
start_DateTime <- "2/1/2014 12:00:00 AM"
end_DateTime <- "2/7/2014 11:45:00 PM"
start <- parse_date_time( start_DateTime,
orders=format,
tz="EST")
end <- parse_date_time( end_DateTime,
orders=format,
tz="EST")
c( start, end )
[1] "2014-02-01 00:00:00 EST" "2014-02-07 23:45:00 EST"
df1 <- df[ df$Date >= start & df$Date <= end, ]
summary( df1 )
Date Weekday AirTemp
Min. :2014-02-01 00:00:00 Monday :96 Min. :-3.594
1st Qu.:2014-02-02 17:56:15 Tuesday :96 1st Qu.: 1.106
Median :2014-02-04 11:52:30 Wednesday:96 Median : 3.778
Mean :2014-02-04 11:52:30 Thursday :96 Mean : 4.370
3rd Qu.:2014-02-06 05:48:45 Friday :96 3rd Qu.: 6.639
Max. :2014-02-07 23:45:00 Saturday :96 Max. :16.550
Sunday :96
PAR
Min. : 0.000
1st Qu.: 0.000
Median : 0.044
Mean : 198.283
3rd Qu.: 277.000
Max. :1365.000
Daytime Filtering
hist( df1$PAR )
df2 <- df1[ df1$PAR > 100,]
summary( df2 )
Date Weekday AirTemp PAR
Min. :2014-02-01 09:00:00 Monday :17 Min. :-2.544 Min. : 104.4
1st Qu.:2014-02-02 14:07:30 Tuesday :34 1st Qu.: 2.500 1st Qu.: 272.9
Median :2014-02-04 14:45:00 Wednesday:30 Median : 5.356 Median : 486.2
Mean :2014-02-04 15:01:43 Thursday :36 Mean : 5.732 Mean : 573.0
3rd Qu.:2014-02-06 12:52:30 Friday :37 3rd Qu.: 7.900 3rd Qu.: 879.5
Max. :2014-02-07 18:00:00 Saturday :36 Max. :16.550 Max. :1365.0
Sunday :37
range( df2$Date[ df2$Weekday == "Monday"])
[1] "2014-02-03 11:15:00 EST" "2014-02-03 16:45:00 EST"
df2 <- df1[ df1$PAR > 25,]
summary( df2 )
Date Weekday AirTemp
Min. :2014-02-01 08:30:00 Monday :36 Min. :-3.228
1st Qu.:2014-02-02 15:37:30 Tuesday :39 1st Qu.: 2.431
Median :2014-02-04 13:30:00 Wednesday:38 Median : 5.306
Mean :2014-02-04 13:47:27 Thursday :40 Mean : 5.470
3rd Qu.:2014-02-06 11:22:30 Friday :41 3rd Qu.: 7.381
Max. :2014-02-07 18:30:00 Saturday :40 Max. :16.550
Sunday :41
PAR
Min. : 25.96
1st Qu.: 154.80
Median : 378.70
Mean : 483.61
3rd Qu.: 775.35
Max. :1365.00
range( df2$Date[ df2$Weekday == "Monday"])
[1] "2014-02-03 09:30:00 EST" "2014-02-03 18:15:00 EST"
Maybe?
Maybe As Sunrise/Sunset
Look at only hours and minutes
test <- df1[ df1$Weekday == "Monday",]
test$hour <- hour( test$Date )
test$minute <- minute( test$Date )
test
df3 <- df1
df3$Hour <- hour( df3$Date )
df3$Minute <- minute( df3$Date )
summary( df3 )
Date Weekday AirTemp
Min. :2014-02-01 00:00:00 Monday :96 Min. :-3.594
1st Qu.:2014-02-02 17:56:15 Tuesday :96 1st Qu.: 1.106
Median :2014-02-04 11:52:30 Wednesday:96 Median : 3.778
Mean :2014-02-04 11:52:30 Thursday :96 Mean : 4.370
3rd Qu.:2014-02-06 05:48:45 Friday :96 3rd Qu.: 6.639
Max. :2014-02-07 23:45:00 Saturday :96 Max. :16.550
Sunday :96
PAR Hour Minute
Min. : 0.000 Min. : 0.00 Min. : 0.00
1st Qu.: 0.000 1st Qu.: 5.75 1st Qu.:11.25
Median : 0.044 Median :11.50 Median :22.50
Mean : 198.283 Mean :11.50 Mean :22.50
3rd Qu.: 277.000 3rd Qu.:17.25 3rd Qu.:33.75
Max. :1365.000 Max. :23.00 Max. :45.00
df4 <- df3[ df3$Hour >= 7 & df3$Minute >= 15,]
summary( df4 )
Date Weekday AirTemp
Min. :2014-02-01 07:15:00 Monday :51 Min. :-3.594
1st Qu.:2014-02-02 19:45:00 Tuesday :51 1st Qu.: 1.606
Median :2014-02-04 15:30:00 Wednesday:51 Median : 4.811
Mean :2014-02-04 15:30:00 Thursday :51 Mean : 5.026
3rd Qu.:2014-02-06 11:15:00 Friday :51 3rd Qu.: 6.944
Max. :2014-02-07 23:45:00 Saturday :51 Max. :16.550
Sunday :51
PAR Hour Minute
Min. : 0.000 Min. : 7 Min. :15
1st Qu.: 0.007 1st Qu.:11 1st Qu.:15
Median : 82.400 Median :15 Median :30
Mean : 279.134 Mean :15 Mean :30
3rd Qu.: 449.500 3rd Qu.:19 3rd Qu.:45
Max. :1297.000 Max. :23 Max. :45
df5 <- df4[ df4$Hour <= 17 & df4$Minute <=30, ]
summary( df5 )
Date Weekday AirTemp PAR
Min. :2014-02-01 07:15:00 Monday :22 Min. :-3.211 Min. : 0.0
1st Qu.:2014-02-02 15:18:45 Tuesday :22 1st Qu.: 1.431 1st Qu.: 89.8
Median :2014-02-04 12:22:30 Wednesday:22 Median : 4.850 Median : 325.1
Mean :2014-02-04 12:22:30 Thursday :22 Mean : 4.775 Mean : 427.3
3rd Qu.:2014-02-06 09:26:15 Friday :22 3rd Qu.: 6.808 3rd Qu.: 731.9
Max. :2014-02-07 17:30:00 Saturday :22 Max. :16.550 Max. :1297.0
Sunday :22
Hour Minute
Min. : 7 Min. :15.0
1st Qu.: 9 1st Qu.:15.0
Median :12 Median :22.5
Mean :12 Mean :22.5
3rd Qu.:15 3rd Qu.:30.0
Max. :17 Max. :30.0
df5[21:30,]
Select - Again
df6 <- df5[ , c("Date","Weekday","AirTemp")]
Create a table that has:
- Date as row
- Minimum, Mean, & Maximum air temperature as columns
day( df6$Date )
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
[75] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6
[112] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[149] 7 7 7 7 7 7
minTemp <- by( df6$AirTemp, day( df6$Date ), min )
meanTemp <- by( df6$AirTemp, day( df6$Date ), mean )
maxTemp <- by( df6$AirTemp, day( df6$Date ), max )
df.table <- data.frame( Minimum = as.numeric( minTemp ),
Average = as.numeric( meanTemp),
Maximum = as.numeric( maxTemp ) )
df.table
df.table$Date <- mdy( paste( "2",1:7,"2014", sep="/") )
df.table$Weekday <- weekdays( df.table$Date )
df.table
Select to Rearrange
df.table1 <- df.table[ , c(5,1,2,3)]
Table Output
library( knitr )
library( kableExtra )
t <- kable( df.table1,
caption="Table 1: Temperature Ranges for daytime air temperature for the first week of February, 2014 at the Rice Rivers Center in Charles City County, Virginia.")
kable_styling( t )
Weekday | Minimum | Average | Maximum |
---|---|---|---|
Saturday | -3.2111111 | 5.143182 | 11.383333 |
Sunday | 5.9722222 | 11.197222 | 16.550000 |
Monday | 4.4833333 | 5.601010 | 7.244444 |
Tuesday | -0.5055556 | 3.268939 | 5.550000 |
Wednesday | 0.7777778 | 3.425000 | 8.644444 |
Thursday | -0.6166667 | 1.162374 | 3.061111 |
Friday | -0.8000000 | 3.629293 | 7.677778 |
library( ggplot2 )
ggplot( df6, aes(x=Date, y=AirTemp, color=Weekday) ) +
geom_line() +
geom_point() +
theme_minimal()
Challenges associated with this approach
ls()
[1] "days" "df" "df.table" "df.table1"
[5] "df1" "df2" "df3" "df4"
[9] "df5" "df6" "end" "end_DateTime"
[13] "format" "maxTemp" "meanTemp" "minTemp"
[17] "rice" "start" "start_DateTime" "t"
[21] "test" "url"
days
df
df.table
df.table1
df1
df2
df3
df4
df5
df6
end
end_DateTime
format
maxTemp
meanTemp
minTemp
rice
start
start_DateTime
t
test
url
OK, so let’s jump in.
GGPlot is to built-in graphics as ____________ is to data workflows.
- Tidyverse
- Tidyverse
- Tidyverse, or
- Tidyverse
Tidyverse is a collection of libraries.
library( tidyverse )
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ purrr 0.3.4 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x lubridate::as.difftime() masks base::as.difftime()
x lubridate::date() masks base::date()
x dplyr::filter() masks stats::filter()
x dplyr::group_rows() masks kableExtra::group_rows()
x lubridate::intersect() masks base::intersect()
x dplyr::lag() masks stats::lag()
x lubridate::setdiff() masks base::setdiff()
x lubridate::union() masks base::union()
And one of the most important things that it includes is the notion of Pipes.
df2 <- SOME_OPERATION( df1 )
df3 <- SOME_OTHER_OPERATION( df2 )
df4 <- A_THIRD_OPERATION( df3 )
ggplot( df4, aes(x=...,y=...) ) + geom_point()
This pattern of operations is very common.
The output of one function becomes the input of another one
Insted of re-assigning all these variables, we can take a shortcut using the pipe function (it is defined in the magrittr
library which is part of the tidyverse
), denoted as %>%
df1 %>%
SOME_OPERATION() %>%
SOME_OTHER_OPERATION() %>%
A_THIRD_OPERATION %>%
ggplot( aes(x=...,y=...) ) + geom_point()
Here is a pipe of the df.table
data.frame into the kable
and kableExtra
functions.
df.table1 %>%
kable( format="html", digits = 2 ) %>%
kable_paper() %>%
column_spec( 2, color=ifelse( df.table1$Minimum < 0, "blue", ""))
Weekday | Minimum | Average | Maximum |
---|---|---|---|
Saturday | -3.21 | 5.14 | 11.38 |
Sunday | 5.97 | 11.20 | 16.55 |
Monday | 4.48 | 5.60 | 7.24 |
Tuesday | -0.51 | 3.27 | 5.55 |
Wednesday | 0.78 | 3.43 | 8.64 |
Thursday | -0.62 | 1.16 | 3.06 |
Friday | -0.80 | 3.63 | 7.68 |
NOTE: It is implicitly passing along the output of the previous item (on the left of the pipe) to serve as the input to the next item (on the right of the pipe). So there is no need to redefine data.frame
objects OR put the names of the data.frames into the function parentheses.
Same for ggplot()
df.table1 %>%
ggplot( aes(x=Weekday,y=Average) ) +
geom_col() +
ylab("Average Air Temperature (°C)") +
theme_classic()
OK, so now let’s get back into the verbs of data analysis. The reason I used these particular keywords is that hey are identical to the function names used in dplyr
(the data pliars library in tidyverse
).
select()
filter()
mutate()
arrange()
group_by()
summarize()
In what follows, I will redo the stuff from above showing how we can incorporate both the pipe function as well as these verb functions. Let’s first reload the data from scratch.
rice <- read_csv( url )
Parsed with column specification:
cols(
.default = col_double(),
DateTime = col_character()
)
See spec(...) for full column specifications.
Select allows us to grab the column by the name in the data.frame
.
rice %>%
select( DateTime, AirTempF ) %>%
head()
To drop columns, you can use the name of the column with a negative sign prepended on it.
rice %>%
select( -RecordID, -SpCond_mScm, -PH_mv, -Depth_ft, -SurfaceWaterElev_m_levelNad83m ) %>%
names()
[1] "DateTime" "PAR" "WindSpeed_mph" "WindDir"
[5] "AirTempF" "RelHumidity" "BP_HG" "Rain_in"
[9] "H2O_TempC" "Salinity_ppt" "PH" "Turbidity_ntu"
[13] "Chla_ugl" "BGAPC_CML" "BGAPC_rfu" "ODO_sat"
[17] "ODO_mgl" "Depth_m"
DateTime
PAR
WindSpeed_mph
WindDir
AirTempF
RelHumidity
BP_HG
Rain_in
H2O_TempC
Salinity_ppt
PH
Turbidity_ntu
Chla_ugl
BGAPC_CML
BGAPC_rfu
ODO_sat
ODO_mgl
Depth_m
You can also use it to re-arrange the column order (and because we are lazy, we have the everything()
function to say ’well, everything else that I haven’t already identified).
rice %>%
select( AirTempF, WindDir, Rain_in, everything() ) %>%
names()
[1] "AirTempF" "WindDir"
[3] "Rain_in" "DateTime"
[5] "RecordID" "PAR"
[7] "WindSpeed_mph" "RelHumidity"
[9] "BP_HG" "H2O_TempC"
[11] "SpCond_mScm" "Salinity_ppt"
[13] "PH" "PH_mv"
[15] "Turbidity_ntu" "Chla_ugl"
[17] "BGAPC_CML" "BGAPC_rfu"
[19] "ODO_sat" "ODO_mgl"
[21] "Depth_ft" "Depth_m"
[23] "SurfaceWaterElev_m_levelNad83m"
AirTempF
WindDir
Rain_in
DateTime
RecordID
PAR
WindSpeed_mph
RelHumidity
BP_HG
H2O_TempC
SpCond_mScm
Salinity_ppt
PH
PH_mv
Turbidity_ntu
Chla_ugl
BGAPC_CML
BGAPC_rfu
ODO_sat
ODO_mgl
Depth_ft
Depth_m
SurfaceWaterElev_m_levelNad83m
Filter allows us to select the rows by attributes of the data withing the table itself.
rice %>%
filter( AirTempF < 32 ) %>%
head()
Mutate allows us to change the columns of the data, either in-place (e.g., replacing the original column) or by adding columns to it.
rice %>%
mutate( Date = parse_date_time( DateTime,
orders=format,
tz="EST") ) %>%
mutate( Weekday = factor( weekdays( Date ),
ordered=TRUE,
levels=days) ) %>%
mutate( AirTemp = (AirTempF - 32) * 5/9 ) %>%
select( Date, Weekday, AirTemp) %>%
summary()
Date Weekday AirTemp
Min. :2014-01-01 00:00:00 Monday :1152 Min. :-15.6950
1st Qu.:2014-01-22 08:22:30 Tuesday :1152 1st Qu.: -0.2528
Median :2014-02-12 16:45:00 Wednesday:1248 Median : 3.0222
Mean :2014-02-12 16:45:00 Thursday :1191 Mean : 3.7751
3rd Qu.:2014-03-06 01:07:30 Friday :1152 3rd Qu.: 8.0056
Max. :2014-03-27 09:30:00 Saturday :1152 Max. : 23.8167
Sunday :1152
Note: I had to do separate mutate()
events here to get the weekday, the first to make it a Date column and the second to use that to make another column for weekdays.
It is also possible to use use this to make more readable column names (“Look ma! No ylab
needed!”). You just have to use the back tick characters to surround the new data column name.
rice %>%
mutate( Date = parse_date_time( DateTime,
orders=format,
tz="EST") ) %>%
mutate( `Air Temperature (°C)` = (AirTempF - 32) * 5/9 ) %>%
select( Date, `Air Temperature (°C)`) %>%
ggplot( aes( x = Date, y = `Air Temperature (°C)`) ) +
geom_line() +
theme_classic()
Arrange is used to sort the data.
rice %>%
arrange( AirTempF ) %>%
select( DateTime, AirTempF ) %>%
head()
Reversing it (e.g., in descending order) is done by prepending a negative sign.
rice %>%
arrange( -AirTempF ) %>%
select( DateTime, AirTempF ) %>%
head()
So here is where we start getting to have some fun. The group_by
function partitions the data and is used to create content for the subsequent steps. Think about the various ways we have used by()
thus far. For these, we had to:
Same things here. It is just that being grouped gives the data.frame
an extra added attribute. Compare the class
object for the rice data.frame
.
class( rice )
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
spec_tbl_df
tbl_df
tbl
data.frame
to what it is after I make weekdays and then group-by()
that column.
rice %>%
mutate( Date = parse_date_time( DateTime,
orders=format,
tz="EST") ) %>%
mutate( Weekday = factor( weekdays( Date ),
ordered=TRUE,
levels=days) ) %>%
group_by( Weekday ) %>%
class()
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
grouped_df
tbl_df
tbl
data.frame
It is the grouped_df
that is used by things like summarize()
when it does its operations. It will make more sense in a minute.
Summarize allows you to take a bit of the original data and then perform operations on it to create a new data.frame
.
Here is how we could get total amount of rain for each weekday in the entire data set from the raw data as a single inquiry.
rice %>%
mutate( Date = parse_date_time( DateTime,
orders=format,
tz="EST") ) %>%
mutate( Weekday = factor( weekdays( Date ),
ordered=TRUE,
levels=days) ) %>%
group_by( Weekday ) %>%
summarize( Rain = sum( Rain_in ) , .groups = 'drop')
The only columns in the group_by
and summarize
statements will be kept and provided as output.