class: left, middle, inverse background-image: url("https://live.staticflickr.com/65535/50362989122_a8ee154fea_k_d.jpg") background-size: cover # .orange[Workflow Judo!🥋] ### Environmental Data Literacy --- # R Data Workflow > Describe the daytime air temperatures at the Rice Rivers Center for the first week of February, 2014. -- To do this, we need to perform the following sequence of general *verb* actions on the data.
-- .blue[Use this as an example] --- # Load Data ```r library( readr ) url <- "https://docs.google.com/spreadsheets/d/1Mk1YGH9LqjF7drJE-td1G_JkdADOU0eMlrP01WFBT8s/pub?gid=0&single=true&output=csv" rice <- read_csv( url ) names( rice ) ``` ``` ## [1] "DateTime" "RecordID" ## [3] "PAR" "WindSpeed_mph" ## [5] "WindDir" "AirTempF" ## [7] "RelHumidity" "BP_HG" ## [9] "Rain_in" "H2O_TempC" ## [11] "SpCond_mScm" "Salinity_ppt" ## [13] "PH" "PH_mv" ## [15] "Turbidity_ntu" "Chla_ugl" ## [17] "BGAPC_CML" "BGAPC_rfu" ## [19] "ODO_sat" "ODO_mgl" ## [21] "Depth_ft" "Depth_m" ## [23] "SurfaceWaterElev_m_levelNad83m" ``` --- # Make Date Data Type 🗓 .greeninline[Mutate] the data by adding a new column that is a `Date` object. ```r library( lubridate ) format <- "%m/%d/%Y %I:%M:%S %p" rice$Date <- parse_date_time( rice$DateTime, orders=format, tz="EST") class( rice$Date ) ``` ``` ## [1] "POSIXct" "POSIXt" ``` ```r summary( rice$Date ) ``` ``` ## Min. 1st Qu. Median ## "2014-01-01 00:00:00" "2014-01-22 08:22:30" "2014-02-12 16:45:00" ## Mean 3rd Qu. Max. ## "2014-02-12 16:45:00" "2014-03-06 01:07:30" "2014-03-27 09:30:00" ``` --- # Make Date Data Type 🗓 Should make it a Factor so we know ordering. ```r days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday") rice$Weekday <- weekdays( rice$Date ) rice$Weekday <- factor( rice$Weekday, ordered=TRUE, levels=days) summary( rice$Weekday ) ``` ``` ## Monday Tuesday Wednesday Thursday Friday Saturday Sunday ## 1152 1152 1248 1191 1152 1152 1152 ``` ```r class( rice$Weekday ) ``` ``` ## [1] "ordered" "factor" ``` --- # 🌡 Fahrenheit to Celsius .pull-left[ .greeninline[Mutate] the data in-line to create new column. ```r library( ggplot2 ) rice$AirTemp <- (rice$AirTempF - 32 ) * 5 / 9 # Examine the data. ggplot( rice, aes(x=AirTemp ) ) + geom_histogram( binwidth=1.0, colour = "#333333" ) + xlab("Air Temperature (°C)") + ylab("Frequency") ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Grab Columns .orangeinline[Select] the columns of data we will be working with. .redinline[And] let's not overwrite our old stuff in case we need to come back. ```r df <- rice[ , c("Date","Weekday","AirTemp", "PAR") ] # Look at the result head( df ) ``` ``` ## # A tibble: 6 x 4 ## Date Weekday AirTemp PAR ## <dttm> <ord> <dbl> <dbl> ## 1 2014-01-01 00:00:00 Wednesday -0.561 0 ## 2 2014-01-01 00:15:00 Wednesday -0.711 0 ## 3 2014-01-01 00:30:00 Wednesday -0.433 0 ## 4 2014-01-01 00:45:00 Wednesday -0.811 0 ## 5 2014-01-01 01:00:00 Wednesday -0.594 0 ## 6 2014-01-01 01:15:00 Wednesday -0.772 0 ``` --- # Filtering Rows Two temporal filters are in play here: - First week in February - Day time -- ```r rice$DateTime[ 25 ] ``` ``` ## [1] "1/1/2014 6:00:00 AM" ``` -- ```r start_DateTime <- "2/1/2014 12:00:00 AM" end_DateTime <- "2/7/2014 11:45:00 PM" start <- parse_date_time( start_DateTime, orders=format, tz="EST") end <- parse_date_time( end_DateTime, orders=format, tz="EST") c( start, end ) ``` ``` ## [1] "2014-02-01 00:00:00 EST" "2014-02-07 23:45:00 EST" ``` --- # Filtering on "First Week of February" ```r df1 <- df[ df$Date >= start & df$Date <= end, ] # Check the Date Range summary( df1 ) ``` ``` ## Date Weekday AirTemp ## Min. :2014-02-01 00:00:00 Monday :96 Min. :-3.594 ## 1st Qu.:2014-02-02 17:56:15 Tuesday :96 1st Qu.: 1.106 ## Median :2014-02-04 11:52:30 Wednesday:96 Median : 3.778 ## Mean :2014-02-04 11:52:30 Thursday :96 Mean : 4.370 ## 3rd Qu.:2014-02-06 05:48:45 Friday :96 3rd Qu.: 6.639 ## Max. :2014-02-07 23:45:00 Saturday :96 Max. :16.550 ## Sunday :96 ## PAR ## Min. : 0.000 ## 1st Qu.: 0.000 ## Median : 0.044 ## Mean : 198.283 ## 3rd Qu.: 277.000 ## Max. :1365.000 ## ``` --- class: middle # Filtering on "Daytime" .pull-left[ Maybe we can use PAR as a measure of "daytime-ness" here. ```r hist( df1$PAR, xlab="Photosynthetically Active Radiation", main="" ) ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> ] --- # First Pass: PAR > 100 ? ```r df2 <- df1[ df1$PAR > 100, ] summary( df2 ) ``` ``` ## Date Weekday AirTemp PAR ## Min. :2014-02-01 09:00:00 Monday :17 Min. :-2.544 Min. : 104.4 ## 1st Qu.:2014-02-02 14:07:30 Tuesday :34 1st Qu.: 2.500 1st Qu.: 272.9 ## Median :2014-02-04 14:45:00 Wednesday:30 Median : 5.356 Median : 486.2 ## Mean :2014-02-04 15:01:43 Thursday :36 Mean : 5.732 Mean : 573.0 ## 3rd Qu.:2014-02-06 12:52:30 Friday :37 3rd Qu.: 7.900 3rd Qu.: 879.5 ## Max. :2014-02-07 18:00:00 Saturday :36 Max. :16.550 Max. :1365.0 ## Sunday :37 ``` -- ```r range( df2$Date[ df2$Weekday == "Monday"]) ``` ``` ## [1] "2014-02-03 11:15:00 EST" "2014-02-03 16:45:00 EST" ``` --- # Second Pass: PAR > 25 ```r df2 <- df1[ df1$PAR > 25, ] summary( df2 ) ``` ``` ## Date Weekday AirTemp ## Min. :2014-02-01 08:30:00 Monday :36 Min. :-3.228 ## 1st Qu.:2014-02-02 15:37:30 Tuesday :39 1st Qu.: 2.431 ## Median :2014-02-04 13:30:00 Wednesday:38 Median : 5.306 ## Mean :2014-02-04 13:47:27 Thursday :40 Mean : 5.470 ## 3rd Qu.:2014-02-06 11:22:30 Friday :41 3rd Qu.: 7.381 ## Max. :2014-02-07 18:30:00 Saturday :40 Max. :16.550 ## Sunday :41 ## PAR ## Min. : 25.96 ## 1st Qu.: 154.80 ## Median : 378.70 ## Mean : 483.61 ## 3rd Qu.: 775.35 ## Max. :1365.00 ## ``` ```r range( df2$Date[ df2$Weekday == "Monday"]) ``` ``` ## [1] "2014-02-03 09:30:00 EST" "2014-02-03 18:15:00 EST" ``` --- # Sunrise🌅 and Sunset 🌆? It is amazing how someone records these data and make them available for all of us by a simple search of the internet. .pull-left[ ![Sunrise 2/1/2014](https://live.staticflickr.com/65535/50381378793_b6517b10fe_w_d.jpg) ![Sunset 2/1/2014](https://live.staticflickr.com/65535/50382255642_a9399a736a_w_d.jpg) ] .pull-right[ ![Sunrise 2/7/2014](https://live.staticflickr.com/65535/50382077786_e59560305e_w_d.jpg) ![Sunset 2/7/2014](https://live.staticflickr.com/65535/50382077716_872bf519a5_w_d.jpg) ] --- # Hours & Minutes ```r test <- df1[ df1$Weekday == "Monday",] test$hour <- hour( test$Date ) test$minute <- minute( test$Date ) test ``` ``` ## # A tibble: 96 x 6 ## Date Weekday AirTemp PAR hour minute ## <dttm> <ord> <dbl> <dbl> <int> <int> ## 1 2014-02-03 00:00:00 Monday 9.24 0.007 0 0 ## 2 2014-02-03 00:15:00 Monday 8.04 0.01 0 15 ## 3 2014-02-03 00:30:00 Monday 6.78 0 0 30 ## 4 2014-02-03 00:45:00 Monday 7.05 0.007 0 45 ## 5 2014-02-03 01:00:00 Monday 7.4 0.029 1 0 ## 6 2014-02-03 01:15:00 Monday 7.74 0.013 1 15 ## 7 2014-02-03 01:30:00 Monday 7.84 0.003 1 30 ## 8 2014-02-03 01:45:00 Monday 9.15 0.01 1 45 ## 9 2014-02-03 02:00:00 Monday 9.93 0.052 2 0 ## 10 2014-02-03 02:15:00 Monday 9.63 0.01 2 15 ## # … with 86 more rows ``` OK! --- # Add Hours & Minutes to Filter ```r df3 <- df1 df3$Hour <- hour( df3$Date ) df3$Minute <- minute( df3$Date ) head( df3 ) ``` ``` ## # A tibble: 6 x 6 ## Date Weekday AirTemp PAR Hour Minute ## <dttm> <ord> <dbl> <dbl> <int> <int> ## 1 2014-02-01 00:00:00 Saturday -0.411 0 0 0 ## 2 2014-02-01 00:15:00 Saturday -0.967 0 0 15 ## 3 2014-02-01 00:30:00 Saturday -0.594 0 0 30 ## 4 2014-02-01 00:45:00 Saturday 0.0833 0 0 45 ## 5 2014-02-01 01:00:00 Saturday -0.211 0 1 0 ## 6 2014-02-01 01:15:00 Saturday -0.0278 0 1 15 ``` --- # Filter out Pre-Dawn ```r df4 <- df3[ df3$Hour >= 7 & df3$Minute >= 15,] # Check summary( df4 ) ``` ``` ## Date Weekday AirTemp ## Min. :2014-02-01 07:15:00 Monday :51 Min. :-3.594 ## 1st Qu.:2014-02-02 19:45:00 Tuesday :51 1st Qu.: 1.606 ## Median :2014-02-04 15:30:00 Wednesday:51 Median : 4.811 ## Mean :2014-02-04 15:30:00 Thursday :51 Mean : 5.026 ## 3rd Qu.:2014-02-06 11:15:00 Friday :51 3rd Qu.: 6.944 ## Max. :2014-02-07 23:45:00 Saturday :51 Max. :16.550 ## Sunday :51 ## PAR Hour Minute ## Min. : 0.000 Min. : 7 Min. :15 ## 1st Qu.: 0.007 1st Qu.:11 1st Qu.:15 ## Median : 82.400 Median :15 Median :30 ## Mean : 279.134 Mean :15 Mean :30 ## 3rd Qu.: 449.500 3rd Qu.:19 3rd Qu.:45 ## Max. :1297.000 Max. :23 Max. :45 ## ``` --- # Filter Out Post-Sundown Notice that the `hour()` function returns values from 0-23 so `5:30 PM` is denoted as `17:30`. ```r df5 <- df4[ df4$Hour <= 17 & df4$Minute <=30, ] # Check summary( df5 ) ``` ``` ## Date Weekday AirTemp PAR ## Min. :2014-02-01 07:15:00 Monday :22 Min. :-3.211 Min. : 0.0 ## 1st Qu.:2014-02-02 15:18:45 Tuesday :22 1st Qu.: 1.431 1st Qu.: 89.8 ## Median :2014-02-04 12:22:30 Wednesday:22 Median : 4.850 Median : 325.1 ## Mean :2014-02-04 12:22:30 Thursday :22 Mean : 4.775 Mean : 427.3 ## 3rd Qu.:2014-02-06 09:26:15 Friday :22 3rd Qu.: 6.808 3rd Qu.: 731.9 ## Max. :2014-02-07 17:30:00 Saturday :22 Max. :16.550 Max. :1297.0 ## Sunday :22 ## Hour Minute ## Min. : 7 Min. :15.0 ## 1st Qu.: 9 1st Qu.:15.0 ## Median :12 Median :22.5 ## Mean :12 Mean :22.5 ## 3rd Qu.:15 3rd Qu.:30.0 ## Max. :17 Max. :30.0 ## ``` --- # Just to Make Sure ```r df5[18:24,] ``` ``` ## # A tibble: 7 x 6 ## Date Weekday AirTemp PAR Hour Minute ## <dttm> <ord> <dbl> <dbl> <int> <int> ## 1 2014-02-01 15:30:00 Saturday 11.3 827 15 30 ## 2 2014-02-01 16:15:00 Saturday 11.1 399 16 15 ## 3 2014-02-01 16:30:00 Saturday 11.0 341. 16 30 ## 4 2014-02-01 17:15:00 Saturday 10.7 124. 17 15 ## 5 2014-02-01 17:30:00 Saturday 10.4 133. 17 30 ## 6 2014-02-02 07:15:00 Sunday 6.62 0 7 15 ## 7 2014-02-02 07:30:00 Sunday 5.97 0 7 30 ``` .center[ Perfectly between sunrise and sunset! ![Sunrise 2/1/2014](https://live.staticflickr.com/65535/50381378793_b6517b10fe_w_d.jpg) ![Sunset 2/1/2014](https://live.staticflickr.com/65535/50382255642_a9399a736a_w_d.jpg) ] --- # Select To Remove Extraneous ```r df6 <- df5[ , c("Date","Weekday", "AirTemp")] head( df6 ) ``` ``` ## # A tibble: 6 x 3 ## Date Weekday AirTemp ## <dttm> <ord> <dbl> ## 1 2014-02-01 07:15:00 Saturday -3.17 ## 2 2014-02-01 07:30:00 Saturday -3.2 ## 3 2014-02-01 08:15:00 Saturday -3.21 ## 4 2014-02-01 08:30:00 Saturday -3.16 ## 5 2014-02-01 09:15:00 Saturday -1.12 ## 6 2014-02-01 09:30:00 Saturday -0.0444 ``` --- # Summarize In Tabular Form From these raw data, we can create another `data.frame` that has each day of the week as a row and the values for temperature, say as `Minimum`, `Mean`, and `Maximum`. ```r minTemp <- by( df6$AirTemp, day( df6$Date ), min ) meanTemp <- by( df6$AirTemp, day( df6$Date ), mean ) maxTemp <- by( df6$AirTemp, day( df6$Date ), max ) df.table <- data.frame( Minimum = as.numeric( minTemp ), Average = as.numeric( meanTemp), Maximum = as.numeric( maxTemp ) ) df.table ``` ``` ## Minimum Average Maximum ## 1 -3.2111111 5.143182 11.383333 ## 2 5.9722222 11.197222 16.550000 ## 3 4.4833333 5.601010 7.244444 ## 4 -0.5055556 3.268939 5.550000 ## 5 0.7777778 3.425000 8.644444 ## 6 -0.6166667 1.162374 3.061111 ## 7 -0.8000000 3.629293 7.677778 ``` --- # Set Dates for Each Row This is kind of a shortcut here. ```r raw_dates <- mdy( paste( "2", 1:7, "2014", sep="/") ) df.table$Weekday <- weekdays( raw_dates ) df.table ``` ``` ## Minimum Average Maximum Weekday ## 1 -3.2111111 5.143182 11.383333 Saturday ## 2 5.9722222 11.197222 16.550000 Sunday ## 3 4.4833333 5.601010 7.244444 Monday ## 4 -0.5055556 3.268939 5.550000 Tuesday ## 5 0.7777778 3.425000 8.644444 Wednesday ## 6 -0.6166667 1.162374 3.061111 Thursday ## 7 -0.8000000 3.629293 7.677778 Friday ``` --- # Select to Reorder Columns ```r df.table1 <- df.table[ , c(4,1,2,3)] df.table1 ``` ``` ## Weekday Minimum Average Maximum ## 1 Saturday -3.2111111 5.143182 11.383333 ## 2 Sunday 5.9722222 11.197222 16.550000 ## 3 Monday 4.4833333 5.601010 7.244444 ## 4 Tuesday -0.5055556 3.268939 5.550000 ## 5 Wednesday 0.7777778 3.425000 8.644444 ## 6 Thursday -0.6166667 1.162374 3.061111 ## 7 Friday -0.8000000 3.629293 7.677778 ``` --- # Tabular Output ```r library( knitr ) library( kableExtra ) t <- kable( df.table1, caption="Table 1: Temperature Ranges for daytime air temperature for the first week of February, 2014 at the Rice Rivers Center in Charles City County, Virginia.") kable_styling( t ) ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>Table 1: Temperature Ranges for daytime air temperature for the first week of February, 2014 at the Rice Rivers Center in Charles City County, Virginia.</caption> <thead> <tr> <th style="text-align:left;"> Weekday </th> <th style="text-align:right;"> Minimum </th> <th style="text-align:right;"> Average </th> <th style="text-align:right;"> Maximum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;"> -3.2111111 </td> <td style="text-align:right;"> 5.143182 </td> <td style="text-align:right;"> 11.383333 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;"> 5.9722222 </td> <td style="text-align:right;"> 11.197222 </td> <td style="text-align:right;"> 16.550000 </td> </tr> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;"> 4.4833333 </td> <td style="text-align:right;"> 5.601010 </td> <td style="text-align:right;"> 7.244444 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;"> -0.5055556 </td> <td style="text-align:right;"> 3.268939 </td> <td style="text-align:right;"> 5.550000 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;"> 0.7777778 </td> <td style="text-align:right;"> 3.425000 </td> <td style="text-align:right;"> 8.644444 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;"> -0.6166667 </td> <td style="text-align:right;"> 1.162374 </td> <td style="text-align:right;"> 3.061111 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;"> -0.8000000 </td> <td style="text-align:right;"> 3.629293 </td> <td style="text-align:right;"> 7.677778 </td> </tr> </tbody> </table> --- # Summarize Graphically ```r ggplot( df6, aes(x=Date, y=AirTemp, color=Weekday) ) + geom_line() + geom_point( size = 3 ) + theme( legend.position = "none" ) ``` <img src="slides_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" /> --- # Challenges to Normal R Workflows The data work flow using indices has several drawbacks including: - Lots of individual steps, each step divided into many chunks (21 chunks to get the data from Google Drive to the Tablular Output). - Uses lots of data frames to hold intermediate options. We created 10 data frames in the process of going from `rice` to `df.table`. -- If you are working with moderately large data sets, this is not a good strategy. --- class: inverse background-image: url("https://live.staticflickr.com/65535/50351963133_cffc707725_c_d.jpg") background-size: contain background-position: right # .green[Tidyverse] .left-column[ .greeninline[ GGPlot is to built-in graphics as `\(\_\_\_\_\_\_\)` is to build in R data work-flows. A) Tidyverse B) Tidyverse C) Tidyverse, or D) Tidyverse ] ] --- # Tidyverse .pull-left[ ![tidy](https://live.staticflickr.com/65535/50295284047_ebb5dec2e8_w_d.jpg) ] .pull-right[A constellation of Libraries: - `dplyr` - `ggplot2` - `purrr` - `tibble` - several more. ] All of these libraries have been defined to help you be more effective at data analysis. --- # Load in the Constilation of Libraries To get the libraries, first load them in<sup>1</sup>. ```r library( tidyverse) ``` <div class="my-footer"><span><sup>1</sup>If you get an error here saying something like <font class="orangeinline">there is no package called ‘tidyverse’</font> then do <tt>install.packages("tidyverse")</tt> and that shoudl fix it</span></div> --- # Common Workflow .middle[ The following general pattern is .fancy[so] common, someone developed a whole package (called `magittr` and it is part of the `tidyverse`) just to make sure we never have to do it the hard way. .large[ .fancy[👉 The output of one function becomes the input of another one] ] ] --- background-image: url("https://live.staticflickr.com/65535/50382456508_bbb16c248d_c_d.jpg") background-size: fit --- # Pipes In Action Pipes remove the need a ton of code writing. .pull-left[ Instead of doing something like this: ```r df2 <- SOME_OPERATION( df1 ) df3 <- SOME_OTHER_OPERATION( df2 ) df4 <- A_THIRD_OPERATION( df3 ) ggplot( df4, aes(x=...,y=...) ) + geom_point() ``` ] -- .pull-right[ We can instead replace it with the pipe operator (`%>%`) and clean it up considerably. ```r df1 %>% SOME_OPERATION() %>% SOME_OTHER_OPERATION() %>% A_THIRD_OPERATION %>% ggplot( aes(x=...,y=...) ) + geom_point() ``` Notice: - .redinline[No] reassigning a bunch of intermediate `data.frame` objects, and - .redinline[No] need to pass a data.frame to the next function, it is by default the first thing passed in. ] --- # Example - Tabular Summary ```r df.table1 %>% kable( format="html", digits = 2) %>% kable_paper( full_width = FALSE ) %>% column_spec( 2, color=ifelse( df.table1$Minimum < 0, "blue", "")) ``` <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Weekday </th> <th style="text-align:right;"> Minimum </th> <th style="text-align:right;"> Average </th> <th style="text-align:right;"> Maximum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;color: blue !important;"> -3.21 </td> <td style="text-align:right;"> 5.14 </td> <td style="text-align:right;"> 11.38 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;color: !important;"> 5.97 </td> <td style="text-align:right;"> 11.20 </td> <td style="text-align:right;"> 16.55 </td> </tr> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;color: !important;"> 4.48 </td> <td style="text-align:right;"> 5.60 </td> <td style="text-align:right;"> 7.24 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;color: blue !important;"> -0.51 </td> <td style="text-align:right;"> 3.27 </td> <td style="text-align:right;"> 5.55 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;color: !important;"> 0.78 </td> <td style="text-align:right;"> 3.43 </td> <td style="text-align:right;"> 8.64 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;color: blue !important;"> -0.62 </td> <td style="text-align:right;"> 1.16 </td> <td style="text-align:right;"> 3.06 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;color: blue !important;"> -0.80 </td> <td style="text-align:right;"> 3.63 </td> <td style="text-align:right;"> 7.68 </td> </tr> </tbody> </table> --- # Example - Graphical Output .pull-left[ We can pipe right into a `ggplot()` chain (n.b., the plot elements are still added (+) together and not piped). ```r df.table1 %>% ggplot( aes(x=Weekday,y=Average) ) + geom_col() + ylab("Average Air Temperature (°C)") ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> ] --- # The `dplyr` Library .pull-left[ .center[ ![DPlyr](https://live.staticflickr.com/65535/50382551848_ee84ba4b78_o_d.png) .fancy[The Grammar of Data Manipulation] ] ] .pull-right[ The *verbs* are actually `functions` from in `dplyr`: - Select is done using function `select()` - Filter is done using function `filter()` - Mutate is done using function `mutate()` - Arrange is done using function `arrange()` - Group is done using function `group_by()` - Summarize is done using function `summarize()` ] When combined with `%>%` ... data magic! --- # <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 581 512"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> Real Example ### Rice Center Monitoring Data So let's grab the Rice Center Data and setp through the process of answering that question: > Describe the daytime air temperatures at the Rice Rivers Center for the first week of February, 2014. ```r rice <- read_csv( url ) names( rice ) ``` ``` ## [1] "DateTime" "RecordID" ## [3] "PAR" "WindSpeed_mph" ## [5] "WindDir" "AirTempF" ## [7] "RelHumidity" "BP_HG" ## [9] "Rain_in" "H2O_TempC" ## [11] "SpCond_mScm" "Salinity_ppt" ## [13] "PH" "PH_mv" ## [15] "Turbidity_ntu" "Chla_ugl" ## [17] "BGAPC_CML" "BGAPC_rfu" ## [19] "ODO_sat" "ODO_mgl" ## [21] "Depth_ft" "Depth_m" ## [23] "SurfaceWaterElev_m_levelNad83m" ``` --- # Select Select allows us to grab the column by the name in the `data.frame`. ```r rice %>% select( DateTime, AirTempF ) %>% head() ``` ``` ## # A tibble: 6 x 2 ## DateTime AirTempF ## <chr> <dbl> ## 1 1/1/2014 12:00:00 AM 31.0 ## 2 1/1/2014 12:15:00 AM 30.7 ## 3 1/1/2014 12:30:00 AM 31.2 ## 4 1/1/2014 12:45:00 AM 30.5 ## 5 1/1/2014 1:00:00 AM 30.9 ## 6 1/1/2014 1:15:00 AM 30.6 ``` --- # Selecting to Drop To drop columns, you can use the name of the column with a negative sign prepended on it. ```r rice %>% select( -RecordID, -SpCond_mScm, -PH_mv, -Depth_ft, -SurfaceWaterElev_m_levelNad83m ) %>% names() ``` ``` ## [1] "DateTime" "PAR" "WindSpeed_mph" "WindDir" ## [5] "AirTempF" "RelHumidity" "BP_HG" "Rain_in" ## [9] "H2O_TempC" "Salinity_ppt" "PH" "Turbidity_ntu" ## [13] "Chla_ugl" "BGAPC_CML" "BGAPC_rfu" "ODO_sat" ## [17] "ODO_mgl" "Depth_m" ``` --- # Selecting to Rearrange You can also use it to re-arrange the column order (and because we are lazy, we have the `everything()` function to say 'well, everything else that I haven't already identified). ```r rice %>% select( AirTempF, WindDir, Rain_in, everything() ) %>% names() ``` ``` ## [1] "AirTempF" "WindDir" ## [3] "Rain_in" "DateTime" ## [5] "RecordID" "PAR" ## [7] "WindSpeed_mph" "RelHumidity" ## [9] "BP_HG" "H2O_TempC" ## [11] "SpCond_mScm" "Salinity_ppt" ## [13] "PH" "PH_mv" ## [15] "Turbidity_ntu" "Chla_ugl" ## [17] "BGAPC_CML" "BGAPC_rfu" ## [19] "ODO_sat" "ODO_mgl" ## [21] "Depth_ft" "Depth_m" ## [23] "SurfaceWaterElev_m_levelNad83m" ``` --- # Filter Filter allows us to select the rows in the data by attributes of the data *within* the table itself. ```r rice %>% filter( AirTempF < 32 ) %>% head() ``` ``` ## # A tibble: 6 x 23 ## DateTime RecordID PAR WindSpeed_mph WindDir AirTempF RelHumidity BP_HG ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1/1/201… 43816 0 3.87 14.6 31.0 80.5 30.3 ## 2 1/1/201… 43817 0 4.79 18.5 30.7 82.1 30.3 ## 3 1/1/201… 43818 0 3.61 16.2 31.2 81.9 30.3 ## 4 1/1/201… 43819 0 2.99 11.5 30.5 83 30.3 ## 5 1/1/201… 43820 0 3.52 11.3 30.9 81.8 30.3 ## 6 1/1/201… 43821 0 3.83 20.0 30.6 82.8 30.3 ## # … with 15 more variables: Rain_in <dbl>, H2O_TempC <dbl>, SpCond_mScm <dbl>, ## # Salinity_ppt <dbl>, PH <dbl>, PH_mv <dbl>, Turbidity_ntu <dbl>, ## # Chla_ugl <dbl>, BGAPC_CML <dbl>, BGAPC_rfu <dbl>, ODO_sat <dbl>, ## # ODO_mgl <dbl>, Depth_ft <dbl>, Depth_m <dbl>, ## # SurfaceWaterElev_m_levelNad83m <dbl> ``` --- # Mutate Mutate allows us to change the columns of the data: ```r rice %>% mutate( Date = parse_date_time( DateTime, orders=format, tz="EST") ) %>% mutate( Weekday = factor( weekdays( Date ), ordered=TRUE, levels=days) ) %>% mutate( AirTemp = (AirTempF - 32) * 5/9 ) %>% select( Date, Weekday, AirTemp) %>% summary() ``` ``` ## Date Weekday AirTemp ## Min. :2014-01-01 00:00:00 Monday :1152 Min. :-15.6950 ## 1st Qu.:2014-01-22 08:22:30 Tuesday :1152 1st Qu.: -0.2528 ## Median :2014-02-12 16:45:00 Wednesday:1248 Median : 3.0222 ## Mean :2014-02-12 16:45:00 Thursday :1191 Mean : 3.7751 ## 3rd Qu.:2014-03-06 01:07:30 Friday :1152 3rd Qu.: 8.0056 ## Max. :2014-03-27 09:30:00 Saturday :1152 Max. : 23.8167 ## Sunday :1152 ``` --- # Naming Columns Nicely .pull-left[ It is also possible to use use this to make more readable column names ("Look ma! No `ylab` needed!"). You just have to use the back tick characters to surround the new data column name. ```r rice %>% mutate( Date = parse_date_time( DateTime, orders=format, tz="EST") ) %>% mutate( `Air Temperature (°C)` = (AirTempF - 32) * 5/9 ) %>% select( Date, `Air Temperature (°C)`) %>% ggplot( aes( x = Date, y = `Air Temperature (°C)`) ) + geom_line() ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-37-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Arrange Arrange is used to sort the data. ```r rice %>% arrange( AirTempF ) %>% select( DateTime, AirTempF ) %>% head() ``` ``` ## # A tibble: 6 x 2 ## DateTime AirTempF ## <chr> <dbl> ## 1 1/30/2014 8:45:00 AM 3.75 ## 2 1/30/2014 9:00:00 AM 3.82 ## 3 1/30/2014 6:45:00 AM 4.43 ## 4 1/30/2014 7:00:00 AM 4.66 ## 5 1/30/2014 8:30:00 AM 4.93 ## 6 1/30/2014 6:30:00 AM 5.02 ``` --- # Reverse Arranging (Deranged perhaps?) Reversing it (e.g., in descending order) is done by prepending a negative sign. ```r rice %>% arrange( -AirTempF ) %>% select( DateTime, AirTempF ) %>% head() ``` ``` ## # A tibble: 6 x 2 ## DateTime AirTempF ## <chr> <dbl> ## 1 3/11/2014 5:45:00 PM 74.9 ## 2 3/11/2014 5:30:00 PM 74.6 ## 3 3/11/2014 3:45:00 PM 74.4 ## 4 3/11/2014 6:00:00 PM 74.1 ## 5 3/11/2014 4:00:00 PM 73.4 ## 6 3/11/2014 4:45:00 PM 73.0 ``` --- # Grouping By A Feature So here is where we start getting to have some fun. The `group_by` function partitions the data and is used to create content for the subsequent steps. Think about the various ways we have used `by()` thus far. For these, we had to: 1. Identify a column to use as a grouping. 2. Apply some function to those individual groups. -- ```r class( rice ) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` --- # Grouping By A Feature After we make a grouping column and then `group-by()` that column, it gains an additional class type (`grouped_df`). ```r rice %>% mutate( Date = parse_date_time( DateTime, orders=format, tz="EST") ) %>% mutate( Weekday = factor( weekdays( Date ), ordered=TRUE, levels=days) ) %>% group_by( Weekday ) %>% class() ``` ``` ## [1] "grouped_df" "tbl_df" "tbl" "data.frame" ``` The overall 'look' of `rice` does not change but it can do cool stuff with `summarize()`. --- # Summarize Summarize allows you to take a bit of the original data and then perform operations on it to create a new `data.frame`. ```r rice %>% mutate( Date = parse_date_time( DateTime, orders=format, tz="EST") ) %>% mutate( Weekday = factor( weekdays( Date ), ordered=TRUE, levels=days) ) %>% group_by( Weekday ) %>% summarize( Rain = sum( Rain_in ) ) ``` ``` ## # A tibble: 7 x 2 ## Weekday Rain ## * <ord> <dbl> ## 1 Monday 1.96 ## 2 Tuesday 1.31 ## 3 Wednesday 0.327 ## 4 Thursday 1.21 ## 5 Friday 0.80 ## 6 Saturday 1.03 ## 7 Sunday 0.256 ``` The only columns in the `group_by` and `summarize` statements will be kept and provided as output. --- # Workflow Judo!🥋 .pull-left[ ```r rice %>% mutate( Date = parse_date_time( DateTime, orders="%m/%d/%Y %I:%M:%S %p", tz="EST") ) %>% mutate( Weekday = factor( weekdays( Date ), ordered=TRUE, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") ) ) %>% mutate( `Temperature (°C)` = (AirTempF - 32) * 5/9 ) %>% select( Date, Weekday, `Temperature (°C)`) %>% filter( hour( Date ) >= 7 & minute( Date ) >= 15, hour( Date ) <= 17 & minute( Date ) <= 30 ) %>% filter( Date >= mdy("2/1/2014") & Date < mdy("2/8/2014") ) %>% group_by( Weekday ) %>% summarize( Minimum = min( `Temperature (°C)` ), Average = mean( `Temperature (°C)`), Maximum = max( `Temperature (°C)` ) ) %>% kable( format="html", digits = 2 ) %>% kable_paper( full_width = FALSE ) %>% column_spec( 2, color=ifelse( df.table1$Minimum < 0, "blue", "")) ``` ] -- .pull-right[ The output table is: <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Weekday </th> <th style="text-align:right;"> Minimum </th> <th style="text-align:right;"> Average </th> <th style="text-align:right;"> Maximum </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Monday </td> <td style="text-align:right;color: blue !important;"> 4.48 </td> <td style="text-align:right;"> 5.60 </td> <td style="text-align:right;"> 7.24 </td> </tr> <tr> <td style="text-align:left;"> Tuesday </td> <td style="text-align:right;color: !important;"> -0.51 </td> <td style="text-align:right;"> 3.27 </td> <td style="text-align:right;"> 5.55 </td> </tr> <tr> <td style="text-align:left;"> Wednesday </td> <td style="text-align:right;color: !important;"> 0.78 </td> <td style="text-align:right;"> 3.43 </td> <td style="text-align:right;"> 8.64 </td> </tr> <tr> <td style="text-align:left;"> Thursday </td> <td style="text-align:right;color: blue !important;"> -0.62 </td> <td style="text-align:right;"> 1.16 </td> <td style="text-align:right;"> 3.06 </td> </tr> <tr> <td style="text-align:left;"> Friday </td> <td style="text-align:right;color: !important;"> -0.80 </td> <td style="text-align:right;"> 3.63 </td> <td style="text-align:right;"> 7.68 </td> </tr> <tr> <td style="text-align:left;"> Saturday </td> <td style="text-align:right;color: blue !important;"> -3.21 </td> <td style="text-align:right;"> 5.14 </td> <td style="text-align:right;"> 11.38 </td> </tr> <tr> <td style="text-align:left;"> Sunday </td> <td style="text-align:right;color: blue !important;"> 5.97 </td> <td style="text-align:right;"> 11.20 </td> <td style="text-align:right;"> 16.55 </td> </tr> </tbody> </table> ] --- class: middle background-image: url("images/contour.png") background-position: right background-size: auto .center[ # 🙋🏻♀️ Questions? ![Peter Sellers](https://live.staticflickr.com/65535/50382906427_2845eb1861_o_d.gif+) ] <p> </p> .bottom[ If you have any questions for about the content presented herein, please feel free to [submit them to me](mailto://rjdyer@vcu.edu) and I'll get back to you as soon as possible.]