class: left, bottom background-image: url("images/contour.png") background-position: right background-size: auto # The Workflow of <br>Data Analysis ### The mechanics of data wizardry 🧙 <p> </p> <p> </p> <img src="images/logo1.svg" width="400px"> --- class: sectionTitle # .green[Actions] ## The *verbs* of data analysis --- background-image: url("https://live.staticflickr.com/65535/50362129663_0d640ad239_k_d.jpg") background-size: cover # Data Operators .pull-left[ There are a finite number of actions (verbs) that we can use on the raw data we work with. They can be combined to yield meaninful (or quimsical) inferences from our data: .red[ Is there more sun on Fridays than on the weekend?] .orange[ What is the distribution of high-tide depths for each <br> day in January?] .blue[ Is there a visible relationship between water salinity &<br> measured pH?] ] --- background-image: url("https://live.staticflickr.com/65535/50362827791_a32934b310_k_d.jpg") background-size: cover # Select .pull-left[ Identify only subset of data columns that you are interested in using.] --- background-image: url("https://live.staticflickr.com/65535/50362989322_6aa00c8398_k_d.jpg") background-size: cover # Filter .pull-left[ Use only some subset of rows in the data based upon qualities wihtin the columns themselves. ] --- background-image: url("https://live.staticflickr.com/65535/50362827946_d8d5508dfd_k_d.jpg") background-size: cover # Mutate .pull-left[ Convert one data type to another, scaling, combining, or making any other derivative component. ] --- background-image: url("https://live.staticflickr.com/65535/50362129893_61851436c8_k_d.jpg") background-size: cover # Arrange .pull-left[ Reorder the data using values in one or more collumns to sort. ] --- background-image: url("https://live.staticflickr.com/65535/50362869456_c869b2a0a9_k_d.jpg") background-size: cover # Group .pull-left[ Partition the data set into groups based upon some taxonomy of categorization. ] --- background-image: url("https://live.staticflickr.com/65535/50362989492_d4e281b741_k_d.jpg") background-size: cover # Summarize .pull-left[ Perform operations on the data to characterize trends in the raw data as summary statistics. ] --- # Combinations Yield Inference Combining these actions together is how we perform the analyses.
--
--- class: sectionTitle # Data Judo 🥋 ### Do your thing *well* then pass it along --- # Let's Load the Data ```r field_data <- read_csv("Field_Data.csv") ``` ```r url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/deq_data/Field_Data.csv" field_data <- read_csv(url) ``` ```r summary( field_data ) ``` ``` ## Fdt_Id Fdt_Sta_Id Fdt_Date_Time Fdt_Depth ## Min. :2964715 Length:116 Length:116 Min. : 0.100 ## 1st Qu.:2965259 Class :character Class :character 1st Qu.: 1.000 ## Median :2965716 Mode :character Mode :character Median : 4.000 ## Mean :2965970 Mean : 5.103 ## 3rd Qu.:2966263 3rd Qu.: 7.000 ## Max. :2967321 Max. :27.000 ## Fdt_Salinity Fdt_Temp_Celcius Fdt_Do_Optical ## Min. :16.48 Min. :22.97 Min. : 0.630 ## 1st Qu.:17.45 1st Qu.:27.20 1st Qu.: 4.970 ## Median :18.58 Median :28.55 Median : 6.845 ## Mean :19.38 Mean :28.12 Mean : 6.101 ## 3rd Qu.:20.84 3rd Qu.:28.95 3rd Qu.: 7.532 ## Max. :26.27 Max. :29.94 Max. :10.000 ``` --- # Example Problem Let's load in the Field Data again and produce a table that measure the average temperature for all stations that have more 10 or more measurements and arranges the output from hot to cold in degrees Fahrenheit (sorry just had to make up a reason to *mutate* a column of data). ```r library( tidyverse ) ``` For this, we'll have to *select* columns, *group*, *summarize*, *filter*, *mutate*, and *arrange*. Let's take these one at a time to see how it all works. --- # `dplyr::select` First Look Let's look at `select()` as a function. ```r ?select ``` In the next set of examples, I'm just going to dump the output to show you but in reality we'd probably either plot it, make a table, or assign it to a new variable for subsequent analysis. ```r select( field_data, Fdt_Sta_Id, Fdt_Temp_Celcius) ``` ``` ## # A tibble: 116 × 2 ## Fdt_Sta_Id Fdt_Temp_Celcius ## <chr> <dbl> ## 1 7ACHE055.97 28.9 ## 2 7ACHE040.39 29.4 ## 3 7ACHE040.39 29.3 ## 4 7ACHE040.39 29.3 ## 5 7ACHE040.39 29.3 ## 6 7ACHE040.39 29.2 ## 7 7ACHE040.39 29.2 ## 8 7ACHE040.39 28.8 ## 9 7ACHE040.39 28.4 ## 10 7ACHE040.39 28.4 ## # … with 106 more rows ``` --- # Piping We can encapsulates the flow diagram in actual code by connecting individual *verbs* (functions) in a work flow with an operator that > passes the output of this function to the input of that of that function .center[ ![](https://upload.wikimedia.org/wikipedia/en/b/b9/MagrittePipe.jpg) ] --- # Piping to Select The `%>%` (yes that is percent-greater than-percent) is the pipe operator that can chain together many operations. ```r field_data %>% select( Fdt_Sta_Id, Fdt_Temp_Celcius ) ``` ``` ## # A tibble: 116 × 2 ## Fdt_Sta_Id Fdt_Temp_Celcius ## <chr> <dbl> ## 1 7ACHE055.97 28.9 ## 2 7ACHE040.39 29.4 ## 3 7ACHE040.39 29.3 ## 4 7ACHE040.39 29.3 ## 5 7ACHE040.39 29.3 ## 6 7ACHE040.39 29.2 ## 7 7ACHE040.39 29.2 ## 8 7ACHE040.39 28.8 ## 9 7ACHE040.39 28.4 ## 10 7ACHE040.39 28.4 ## # … with 106 more rows ``` Notice the following: - Just putting `field_data` on a line prints it out, - The `%>%` takes that as input and passes it as the first argument to the next function, - We *did not* have to quote the names (as long as there is no spaces in them), --- # Example - Arrange Temperature Descenting RStudio will automatically indent subsequent lines for you as a visual reminder that you are continuing on the same analysis pipeline. ```r field_data %>% arrange( -Fdt_Temp_Celcius) ``` ``` ## # A tibble: 116 × 7 ## Fdt_Id Fdt_Sta_Id Fdt_Date_Time Fdt_Depth Fdt_Salinity Fdt_Temp_Celcius ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2966258 7ACHE055.97 8/12/2020 13:00 5 17.8 29.9 ## 2 2966242 7ACHE055.97 8/12/2020 13:00 0.1 17.6 29.9 ## 3 2966245 7ACHE055.97 8/12/2020 13:00 0.5 17.6 29.8 ## 4 2965635 7ACHE044.14 8/11/2020 15:30 0.1 16.6 29.6 ## 5 2965643 7ACHE044.14 8/11/2020 15:30 2 16.6 29.6 ## 6 2966246 7ACHE055.97 8/12/2020 13:00 1 17.7 29.6 ## 7 2965646 7ACHE044.14 8/11/2020 15:30 3 16.6 29.6 ## 8 2965704 7ACHE047.42 8/11/2020 14:45 0.1 16.9 29.4 ## 9 2965707 7ACHE047.42 8/11/2020 14:45 0.5 16.9 29.4 ## 10 2965699 7ACHE040.39 8/11/2020 13:30 9.5 18.8 29.4 ## # … with 106 more rows, and 1 more variable: Fdt_Do_Optical <dbl> ``` --- # Filtering Data - choosing rows to use ```r field_data %>% filter( Fdt_Depth > 8) ``` ``` ## # A tibble: 19 × 7 ## Fdt_Id Fdt_Sta_Id Fdt_Date_Time Fdt_Depth Fdt_Salinity Fdt_Temp_Celcius ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2966267 7ACHE055.97 8/12/2020 13:00 15 18.4 28.9 ## 2 2965698 7ACHE040.39 8/11/2020 13:30 9 18.8 28.4 ## 3 2965699 7ACHE040.39 8/11/2020 13:30 9.5 18.8 29.4 ## 4 2966239 7ACHE055.60 8/12/2020 10:45 27 20.4 28.3 ## 5 2967318 7ACHE040.04 8/18/2020 12:00 9 25.2 25.6 ## 6 2967319 7ACHE040.04 8/18/2020 12:00 10 25.2 25.6 ## 7 2967321 7ACHE040.04 8/18/2020 12:00 11.5 25.3 25.6 ## 8 2966236 7ACHE055.60 8/12/2020 10:45 15 19.3 28.4 ## 9 2966238 7ACHE055.60 8/12/2020 10:45 25 20.3 28.3 ## 10 2966265 7ACHE055.97 8/12/2020 13:00 9 17.8 28.9 ## 11 2966266 7ACHE055.97 8/12/2020 13:00 10 17.9 28.9 ## 12 2966269 7ACHE055.97 8/12/2020 13:00 20 18.6 29.0 ## 13 2965273 7ACHE023.47 8/11/2020 8:30 11 22.0 28.3 ## 14 2967320 7ACHE040.04 8/18/2020 12:00 11 25.3 25.6 ## 15 2966237 7ACHE055.60 8/12/2020 10:45 20 19.7 28.3 ## 16 2965258 7ACHE023.47 8/11/2020 8:30 9 21.6 28.5 ## 17 2965259 7ACHE023.47 8/11/2020 8:30 10 21.8 28.4 ## 18 2966234 7ACHE055.60 8/12/2020 10:45 9 18.7 28.5 ## 19 2966235 7ACHE055.60 8/12/2020 10:45 10 19.0 28.4 ## # … with 1 more variable: Fdt_Do_Optical <dbl> ``` --- # Group & Summarize These two **always** come as a pair. We use a column of data to group records and then perform some operation on those records *independently* for each level of that grouping variable. -- .green[Example: Average temperature for each station] -- ```r field_data %>% group_by( Fdt_Sta_Id ) %>% summarize( `Temperature (°C)` = mean( Fdt_Temp_Celcius)) ``` ``` ## # A tibble: 10 × 2 ## Fdt_Sta_Id `Temperature (°C)` ## <chr> <dbl> ## 1 7ACHE004.29 26.6 ## 2 7ACHE013.48 28.0 ## 3 7ACHE023.47 28.7 ## 4 7ACHE026.06 26.8 ## 5 7ACHE040.04 26.5 ## 6 7ACHE040.39 29.0 ## 7 7ACHE044.14 28.6 ## 8 7ACHE047.42 28.5 ## 9 7ACHE055.60 28.7 ## 10 7ACHE055.97 29.2 ``` --- # Summarize Changes DataFrame The results of a `summarize()` function **only** has columns designated by `group_by` or made *de novo* in `summarize()` ```r field_data %>% group_by( Fdt_Sta_Id ) %>% summarize( N = n(), Minimum = min( Fdt_Temp_Celcius), Mean = mean( Fdt_Temp_Celcius), Max = mean( Fdt_Temp_Celcius ) ) ``` ``` ## # A tibble: 10 × 5 ## Fdt_Sta_Id N Minimum Mean Max ## <chr> <int> <dbl> <dbl> <dbl> ## 1 7ACHE004.29 9 23.0 26.6 26.6 ## 2 7ACHE013.48 9 26.5 28.0 28.0 ## 3 7ACHE023.47 13 28.3 28.7 28.7 ## 4 7ACHE026.06 10 25.7 26.8 26.8 ## 5 7ACHE040.04 14 25.6 26.5 26.5 ## 6 7ACHE040.39 12 28.4 29.0 29.0 ## 7 7ACHE044.14 10 25.7 28.6 28.6 ## 8 7ACHE047.42 9 23.4 28.5 28.5 ## 9 7ACHE055.60 16 28.3 28.7 28.7 ## 10 7ACHE055.97 14 28.9 29.2 29.2 ``` --- # A Realistic Problem What is the average temperature for stations whose range of depths is greater than 10 (units anyone?) arranged in decreasing temperature. ```r # load the data # select stations, temperature, and depth # group by station # summarize # samples, range of depth, and mean of temperature. # filter on sample size # arrange by temperature # select out # samples and range of depths ``` --- # 15 Minute Activity - Your Turn Create an R script in the project folder named `tidyverse_examples.R` and answer the following questions using the Field_Data.csv as a data source. 1. Load in `library(tidyverse)` at the top of the file. 2. Load the field data in and assign it to a variable of suitable nomenclature. 3. Which station has the largest variation in DO? 4. Make a new `data.frame` that has min, mean, and max temperature and salinity by `Fdt_Sta_Id`. --- class: middle background-image: url("images/contour.png") background-position: right background-size: auto .center[ ![## Any Questions](https://media.giphy.com/media/3o6MbhEsVnMOkWul44/giphy.gif) ]