class: left, middle, inverse background-image: url("https://live.staticflickr.com/65535/50362989122_a8ee154fea_k_d.jpg") background-size: cover # .orange[Relational Data!] ## .fancy[Data Within Many Tables] --- # Multi-Table Data Rarely do we keep our data within a single data table. .pull-left[ #### Table Structure Consider the two tables to the right. - Common column labeled *Key* - Other data columns (only 1 in each for brevity) ] .pull-right[![Example data table structure](https://live.staticflickr.com/65535/50427672632_24e45139a8_c_d.jpg)] --- # Varieties of Keys 🗝 .pull-left[### Primary Key A *primary key* is a column in a table that uniquely identifies a single row. There MUST be a unique identifier to be a *Primary Key*. ```r people <- data.frame( Key = c("A","B","C"), Name = c("Bob","Alice","Mary"), Major = c("ENVS", "ENVS", "BIOL") ) people %>% count( Key ) ``` ``` ## Key n ## 1 A 1 ## 2 B 1 ## 3 C 1 ``` ] -- .pull-right[ ### Foreign Key A foreign key is one that references a primary key in *another* table. ```r homework <- data.frame( Key = c("A","B","A","D","B"), Grade = rpois(5,lambda = 100) ) homework ``` ``` ## Key Grade ## 1 A 115 ## 2 B 114 ## 3 A 93 ## 4 D 107 ## 5 B 80 ``` Here the `Key` column is referencing a unique row of data in the `people` data table. ] --- class: sectionTitle, inverse # .green[Joins] ### .fancy[Merging data tables] --- # Taxonomy of Joins We can combine the data in these tables in several different ways based upon what we are looking for. When thinking of joins, we must think about the how we want to select the overlapping sets of keys in both data.frames. - *Full Join* (aka *outer join*) - *Left Join* - *Right Join* - *Inner Join* The .redinline[*position*] adjective relates to which of the rows are selected for the join and end up in the resutling data table. --- # 'Left' & 'Right' Tables .center[ ![Example data table structure](https://live.staticflickr.com/65535/50427672632_24e45139a8_c_d.jpg) ] --- # Full/Outer Join The outer join has .redinline[all the data from both left & right tables]. All keys are present in the result. ![Outer Join](https://live.staticflickr.com/65535/50427993992_4ccede1574_c_d.jpg) --- # Outer Join All homework and people data. ```r people %>% full_join( homework, by="Key" ) ``` ``` ## Key Name Major Grade ## 1 A Bob ENVS 115 ## 2 A Bob ENVS 93 ## 3 B Alice ENVS 114 ## 4 B Alice ENVS 80 ## 5 C Mary BIOL NA ## 6 D <NA> <NA> 107 ``` --- # Left Join The left join is one where the result has all the keys from the left but only those in the right one that are in the left. ![left join](https://live.staticflickr.com/65535/50427817371_678f0f64c7_c_d.jpg) -- ```r people %>% left_join( homework, by="Key") ``` ``` ## Key Name Major Grade ## 1 A Bob ENVS 115 ## 2 A Bob ENVS 93 ## 3 B Alice ENVS 114 ## 4 B Alice ENVS 80 ## 5 C Mary BIOL NA ``` --- # Right Join The Right join results in all the keys from the right data table and the matching ones from the left. .center[ ![Right Join](https://live.staticflickr.com/65535/50427125528_0de6281475_c_d.jpg) ] -- ```r people %>% right_join( homework, by="Key") ``` ``` ## Key Name Major Grade ## 1 A Bob ENVS 115 ## 2 A Bob ENVS 93 ## 3 B Alice ENVS 114 ## 4 B Alice ENVS 80 ## 5 D <NA> <NA> 107 ``` --- # Inner Joins Inner joins result in the intersection of keys. ![Inner Join](https://live.staticflickr.com/65535/50427125683_ac44eb1500_c_d.jpg) -- ```r people %>% inner_join( homework, by="Key") ``` ``` ## Key Name Major Grade ## 1 A Bob ENVS 115 ## 2 A Bob ENVS 93 ## 3 B Alice ENVS 114 ## 4 B Alice ENVS 80 ``` --- class: sectionTitle, inverse # .blue[Filtering Joins] ## .fancy[Not combining but refining...] --- # The Semi Join We can also use joins to filter values within one `data.frame`. Here the `semi_join()` keeps everything in the left data that has a key in the right one, but **importantly** it does not import the right data columns into the result. -- ```r people %>% semi_join( homework, by="Key") ``` ``` ## Key Name Major ## 1 A Bob ENVS ## 2 B Alice ENVS ``` --- # The Anti Join The opposite of the `seim_join()` is the `anti_join()` which drops everything in the left table that has a key in the right one, leaving only the ones that are unique. -- ```r people %>% anti_join( homework, by = "Key") ``` ``` ## Key Name Major ## 1 C Mary BIOL ``` --- class: sectionTitle, inverse # .green[The NYC Airport Data] ### .fancy[Let's dial up the complexity] --- # The Library .pull-left[ Four `data.frames` are contained within the library representing all the flights from NYC airports in 2013. ```r library( nycflights13 ) ``` - `airlines` - `airports` - `planes` - `weather` - `flights` ] .pull-right[
] --- # Airlines ```r head(airlines) ``` ``` ## # A tibble: 6 x 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ``` --- # Airports ```r head( airports ) ``` ``` ## # A tibble: 6 x 8 ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_York ## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chicago ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago ## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_York ## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_York ## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/New_York ``` --- # Planes ```r head( planes ) ``` ``` ## # A tibble: 6 x 9 ## tailnum year type manufacturer model engines seats speed engine ## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> ## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan ## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan ## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan ## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan ## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan ## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan ``` --- # Weather ```r head(weather) ``` ``` ## # A tibble: 6 x 15 ## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib ## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA 0 1012 10 ## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA 0 1012. 10 ## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA 0 1012. 10 ## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA 0 1012. 10 ## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA 0 1012. 10 ## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA 0 1012. 10 ## # … with 1 more variable: time_hour <dttm> ``` --- # Flights This is the main `data.frame`. ```r names( flights ) ``` ``` ## [1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay" ## [7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum" ## [13] "origin" "dest" "air_time" "distance" "hour" "minute" ## [19] "time_hour" ``` --- class: middle background-image: url("images/contour.png") background-position: right background-size: auto .center[ # ️ Questions? ![Peter Sellers](https://live.staticflickr.com/65535/50382906427_2845eb1861_o_d.gif+) ] <p> </p> .bottom[ If you have any questions for about the content presented herein, please feel free to [submit them to me](mailto://rjdyer@vcu.edu) and I'll get back to you as soon as possible.]