class: left, bottom background-image: url("images/contour.png") background-position: right background-size: auto # Numerical Data<br/> & Data Frames ### This is where we leave ![Excel](images/excel.png) behind <p> </p> <p> </p> <img src="images/logo1.svg" width="400px"> --- class: sectionTitle, inverse # Numerical Data --- # Numerical Data In `R`, numerical data is largely represented by a data type called `numeric`. -- - For most purposes, this is the only data type we will need (though `integer` types and specialized libraries exist). -- - Magnitude determined by your computer (my MacBook can handle 2.225074e-308 - 1.797693e+308). --- # Operators In many ways, `R` can act just like an interactive calculator. *Arithmetic operators* are just like normal. ```r x <- 10 y <- 23 x + y x - y x * y x / y ``` --- # Exponential Operators *Exponents* use the carat on the keyboard (on us-english keyboards it is above the #6 key). So the value of `\(2^16\)` is ```r 2^16 ``` ``` ## [1] 65536 ``` -- Roots are found by inverting the exponent. For example, the `\(\;^3\sqrt{27}\)` (cube-root of 27) is ```r 27^(1/3) ``` ``` ## [1] 3 ``` --- # Logrithms The logrithms are provided as the function `log()` which defaults to the natural log ```r log( 10 ) ``` ``` ## [1] 2.302585 ``` -- You can change the base by passing the function the optional argument (make sure you separate the value from the optional argument with a comma). ```r log( 10, base=10 ) ``` ``` ## [1] 1 ``` --- # Additional Operators .center[ *Potential Operations >>> Symbols on Keyboard* ] *Modulus Operator* ```r 23 %% 10 ``` ``` ## [1] 3 ``` --- # Order of Operations The order of precedence for operations are just like you learned in math class. ```r x1 <- 23 y1 <- 55 x2 <- 56 y2 <- 63 distance <- sqrt( (x1-x2)^2 + (y1-y2)^2 ) distance ``` ``` ## [1] 33.95585 ``` --- # `?Syntax` Operator | Description ----------------|------------------------------------------- :: ::: | access variables in a namespace $ @ | component / slot extraction [ [[ | indexing ^ | exponentiation (right to left) - + | unary minus and plus : | sequence operator %any% | special operators (including %% and %/%) * / | multiply, divide + - | (binary) add, subtract < > <= >= == != | ordering and comparison ! | negation & && | and | || | or ~ | as in formulae -> ->> | rightwards assignment <- <<- | assignment (right to left) = | assignment (right to left) ? | help (unary and binary) --- class: sectionTitle # Introspection & Coercion --- # Introspection In `R`, each variable can be queried about it's `class` (what kind of data that particular variable holds). ```r x <- 42 class( x ) ``` ``` ## [1] "numeric" ``` -- You can also ask if it is a particular type using the `is.numeric()` function. ```r is.numeric( x ) ``` ``` ## [1] TRUE ``` --- # Coercion We can also turn *one representation* of our data into a different different type, though there are limitations. For example, if we just read in a text file and it has a represented as text (a [Character Data Type](../character_data/slides.html) in `R`) but we need to have it function as a `numeric` type, we can use the following approach ```r x <- "42" class( x ) ``` ``` ## [1] "character" ``` -- The create a new variable who (if possible) contains the numeric representation of the character string `"42"`. ```r y <- as.numeric( x ) class(y) ``` ``` ## [1] "numeric" ``` --- # Coercion Fail When it fails, it returns a warning and a missing data value. ```r as.numeric( "Bob" ) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] NA ``` -- <div class="box-red">It is acknowledged that many error messages in R may not be "comprehensible" to the user and it is not clear if this is a *feature* or a *bug*.</div> --- class: sectionTitle # Caveats --- # Order of Operations There are times that the order of operations will really come back to .red[bite you]. Consider this example where I create a sequence of numbers using the sequence operator (`:`) ```r n <- 4 1:n ``` ``` ## [1] 1 2 3 4 ``` -- So if we wanted to make a sequence from 1 to `\(n-1\)`, we *could* type this: ```r 1:n-1 ``` -- ``` ## [1] 0 1 2 3 ``` --- To *fix* this, feel free to be *verbose* in your use of parentheses. If you are intending to get `\(10^2\)`, `\(10^3\)` `\(\ldots\)` `\(10^6\)` and type it as: ```r 10^2:6 ``` ``` ## [1] 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 ## [20] 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 ## [39] 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 ## [58] 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 ## [77] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 ``` -- What you want is: ```r 10^(2:6) ``` ``` ## [1] 1e+02 1e+03 1e+04 1e+05 1e+06 ``` <div class="box-yellow">Notice how the second (and intended) code is actually easier to read than the first.</div> --- # Numerical Approximations Computers use binary switches to represent numbers. For integers, it is great, but for floating point numbers it .red[sucks], big time. Consider the following: ```r x <- .1 y <- .3 / 3 ``` But if we ask if they are equal, what do you expect? -- ```r x == y ``` ``` ## [1] FALSE ``` -- ```r print(x, digits=20) ``` ``` ## [1] 0.10000000000000000555 ``` ```r print(y, digits=20) ``` ``` ## [1] 0.099999999999999991673 ``` --- ## 15 Minute Activity - Numerical Operations Create an R script named `numerical_operators.R` in the project folder and answer the following questions. Copy each of these questions as commented text into your script. 1. Define a variable named `temp` and set it to the temperature of this room. Did you use degrees Fahrenheit? Write the code to convert this to Celcius. (or the other way around if you used the SI). 2. The function `rnorm(500)` will give you 500 random number from the normal probability distribution. Use it to assign these values to a variable named `theData`. Find them mean, variance, and standard deviation of these data (hint: `mean()`, `var()`, and `sd()` are what you are looking for—use the help function for these to learn more about them). Also try `summary()`. 3. Consider Dr. Dyer’s need for fresh [charcuterie](https://en.wikipedia.org/wiki/Charcuterie) in his life. Luckily, Richmond has a spectacular butcher in Carytown, [Belmont Butchery](http://belmontbutchery.com/). Below are the coordinates for both Dyer’s office and the purveyor of fine meat products denoted as Meters in Virginia State Plane (4502). Use your old friend, the Pythagorean theorem (shown a few slides ago) to figure out the distance between these two points. Present your results in miles. ```r office <- c(3592374.948, 1134930.213) belmont <- c(3590195.540, 1136003.201) ``` --- class: sectionTitle, inverse # Data Frames! ![Yes](https://media.giphy.com/media/f6VfCFyOL5KmiICskp/giphy.gif) --- # Data Frames & Related Materials > Data frames are a structure that can hold many different data types in one simple structure. Data frames are the *lingua franca* for `R`, especially once we start getting into more complicated analysis and manipulation. For simplicity, one can consider a `data.frame` object much like a spreadsheet. Each row represents a record on some object and each column—consisting of different kinds of data—are measurements on that object. ---
--- # Introspection ```r class( iris ) ``` ``` ## [1] "data.frame" ``` ```r dim( iris ) ``` ``` ## [1] 150 5 ``` ```r summary( iris ) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## ``` --- # Properties of Data Frame Objects ```r nrow(iris) ``` ``` ## [1] 150 ``` ```r ncol(iris) ``` ``` ## [1] 5 ``` ```r names(iris) ``` ``` ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" ``` --- # Accessing Internal Elements Accessing elements within a `data.frame` can be done by grid position (row,col) or by column entry. Here is an example showing the third entry in the `Petal.Width` column using numerical coordinates: ```r iris[3,4] ``` ``` ## [1] 0.2 ``` -- And column entries. ```r iris$Petal.Width[3] ``` ``` ## [1] 0.2 ``` ### ☝️ column notation much more readable --- # Creating Raw Data Frames Data frames can hold different kinds data types in a grid-like format. *Rows* are records for observations and *Columns* represent individual measurements on each object. ```r site <- c( "Const","ESan", "Aqu") longitude <- c( -111.675, -110.3686, -110.1043) latitude <- c(25.0247, 24.45879, 23.2855) ``` -- ```r sites <- data.frame( Site = site, Longitude = longitude, Latitude = latitude ) class( sites ) ``` ``` ## [1] "data.frame" ``` ```r dim( sites ) ``` ``` ## [1] 3 3 ``` ```r names( sites ) # shorthand for colnames ``` ``` ## [1] "Site" "Longitude" "Latitude" ``` --- # Viewing Data Frame Objects. If the data are small enough, we can visualize it all by printing out the elements. It is also possible have each column of data to summarize itself. .pull-left[ ```r sites ``` ``` ## Site Longitude Latitude ## 1 Const -111.6750 25.02470 ## 2 ESan -110.3686 24.45879 ## 3 Aqu -110.1043 23.28550 ``` `RStudio` has a built-in spreadsheet if you need to make quick observations or edits ```r View(sites) ``` ] -- .pull-right[ ```r summary( sites ) ``` ``` ## Site Longitude Latitude ## Length:3 Min. :-111.7 Min. :23.29 ## Class :character 1st Qu.:-111.0 1st Qu.:23.87 ## Mode :character Median :-110.4 Median :24.46 ## Mean :-110.7 Mean :24.26 ## 3rd Qu.:-110.2 3rd Qu.:24.74 ## Max. :-110.1 Max. :25.02 ``` ] --- # Tibbles The `tidyverse` extends a `data.frame` by giving it more functionality. This is *largely opaque* to us, because any time we use functions from `tidy`, they do the conversions automatically. ```r library( tidyverse ) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ## ✓ readr 2.1.1 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` --- # Loading Data We can load data from local files (on your computer), from databases (local or external), or from any location we can access a fully qualitified domain name (e.g., a URL). ```r url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/arapat.csv" ``` -- ```r samples <- read_csv( url ) ``` ``` ## Rows: 39 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Stratum ## dbl (2): Longitude, Latitude ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` --- # Showing the Data ```r summary( samples ) ``` ``` ## Stratum Longitude Latitude ## Length:39 Min. :-114.3 Min. :23.08 ## Class :character 1st Qu.:-112.9 1st Qu.:24.52 ## Mode :character Median :-111.5 Median :26.21 ## Mean :-111.7 Mean :26.14 ## 3rd Qu.:-110.4 3rd Qu.:27.47 ## Max. :-109.1 Max. :29.33 ``` -- Since `read_csv()` produces a tibble itself as output (as do *all functions in tidyverse*), there is no need to convert it from being a vanilla `data.frame`. ```r class( samples ) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` --- # Sizes of Data Objects Both `data.frame` and `tibble` objects have a number of rows and columns that make up their dimensions. ```r nrow( samples ) ``` ``` ## [1] 39 ``` ```r ncol( samples ) ``` ``` ## [1] 3 ``` ```r dim( samples ) ``` ``` ## [1] 39 3 ``` ```r names( samples ) ``` ``` ## [1] "Stratum" "Longitude" "Latitude" ``` --- .pull-left[ # Visualizing Data One of the first things I like to do is to look at the data that is being imported and see if there are any obvious problems. These data have spatial coordinates for sites, so here I'll map it interactively (we'll get to this tomorrow. ] .pull-right[
] --- # Small Items - Skipping Metadata Sometimes there are meta-data rows at the top that must be skipped. Imagine a data file that has the following contents (not too uncommon among people who harbor a mild grudge against most data analysts...) ``` Collected on 7 September 2021 By RJ Dyer Site, Longitude , Latitude Const, -111.6750, 25.02470 ESan, -110.3686, 24.45879 Aqu, -110.1043, 23.28550 ``` --- ```r read_csv( "Collected on 7 September 2021 By RJ Dyer Site, Longitude , Latitude Const, -111.6750, 25.02470 ESan, -110.3686, 24.45879 Aqu, -110.1043, 23.28550" , skip=2) ``` ``` ## Rows: 3 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Site ## dbl (2): Longitude, Latitude ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` ## # A tibble: 3 × 3 ## Site Longitude Latitude ## <chr> <dbl> <dbl> ## 1 Const -112. 25.0 ## 2 ESan -110. 24.5 ## 3 Aqu -110. 23.3 ``` (Note: I'm just passing a multiline string to the `read_csv` function.) --- # No Column Names 🛑 ```r read_csv( "Const, -111.6750, 25.02470 ESan, -110.3686, 24.45879 Aqu, -110.1043, 23.28550") ``` ``` ## Rows: 2 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Const ## dbl (2): -111.6750, 25.02470 ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` ## # A tibble: 2 × 3 ## Const `-111.6750` `25.02470` ## <chr> <dbl> <dbl> ## 1 ESan -110. 24.5 ## 2 Aqu -110. 23.3 ``` --- # No Column Names 👍🏾 ```r read_csv( "Const, -111.6750, 25.02470 ESan, -110.3686, 24.45879 Aqu, -110.1043, 23.28550" , col_names = FALSE) ``` ``` ## Rows: 3 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): X1 ## dbl (2): X2, X3 ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` ## # A tibble: 3 × 3 ## X1 X2 X3 ## <chr> <dbl> <dbl> ## 1 Const -112. 25.0 ## 2 ESan -110. 24.5 ## 3 Aqu -110. 23.3 ``` --- # Adding or Override Names ```r read_csv( "Const, -111.6750, 25.02470 ESan, -110.3686, 24.45879 Aqu, -110.1043, 23.28550", col_names = c("Site","Longitude","Latitude") ) ``` ``` ## Rows: 3 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Site ## dbl (2): Longitude, Latitude ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` ## # A tibble: 3 × 3 ## Site Longitude Latitude ## <chr> <dbl> <dbl> ## 1 Const -112. 25.0 ## 2 ESan -110. 24.5 ## 3 Aqu -110. 23.3 ``` --- # Missing Data ```r read_csv( "Site, Longitude , Latitude Const, , 25.02470 ESan, -110.3686, Aqu, -110.1043, 23.28550") -> df ``` ``` ## Rows: 3 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Site ## dbl (2): Longitude, Latitude ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ```r df ``` ``` ## # A tibble: 3 × 3 ## Site Longitude Latitude ## <chr> <dbl> <dbl> ## 1 Const NA 25.0 ## 2 ESan -110. NA ## 3 Aqu -110. 23.3 ``` `NA` is a valid data type! --- # Dealing with `NA` values The absence of data, `NA`, is important and `R` makes a big deal about warning you when you have missing data so you do not make improper inferences. ```r df$Longitude ``` ``` ## [1] NA -110.3686 -110.1043 ``` ```r mean( df$Longitude ) ``` ``` ## [1] NA ``` -- **By Default**, `R` does not *assume* that you want to ignore the missing data, you **must** tell it to do so each time. ```r mean( df$Longitude, na.rm=TRUE ) ``` ``` ## [1] -110.2364 ``` --- # Missing Data Non-Traditional ```r read_csv( "Site, Longitude , Latitude Const, -9 , 25.02470 ESan, -110.3686, -9 Aqu, -110.1043, 23.28550", na="-9") ``` ``` ## Rows: 3 Columns: 3 ``` ``` ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): Site ## dbl (2): Longitude, Latitude ``` ``` ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` ``` ## # A tibble: 3 × 3 ## Site Longitude Latitude ## <chr> <dbl> <dbl> ## 1 Const NA 25.0 ## 2 ESan -110. NA ## 3 Aqu -110. 23.3 ``` --- # Slicing and Dicing When we take a slice of data from a `tibble` (or `data.frame`), how we ask for it may determine the nature of what is returned to us. ```r summary( samples ) ``` ``` ## Stratum Longitude Latitude ## Length:39 Min. :-114.3 Min. :23.08 ## Class :character 1st Qu.:-112.9 1st Qu.:24.52 ## Mode :character Median :-111.5 Median :26.21 ## Mean :-111.7 Mean :26.14 ## 3rd Qu.:-110.4 3rd Qu.:27.47 ## Max. :-109.1 Max. :29.33 ``` ```r class( samples$Stratum ) ``` ``` ## [1] "character" ``` ```r class( samples[,1]) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` --- # Subsets by Position ```r samples[1:10, ] ``` ``` ## # A tibble: 10 × 3 ## Stratum Longitude Latitude ## <chr> <dbl> <dbl> ## 1 88 -114. 29.3 ## 2 9 -114. 29.0 ## 3 84 -114. 29.0 ## 4 175 -113. 28.7 ## 5 177 -114. 28.7 ## 6 173 -113. 28.4 ## 7 171 -113. 28.2 ## 8 89 -113. 28.0 ## 9 159 -113. 27.5 ## 10 SFr -113. 27.4 ``` --- # Filtering By Logic So far, we've used the actual row numbers to grab data from the tibble. We can also use logic based upon data within the table itself. Remember, that a relational operator will return `TRUE` or `FALSE` and we can use that to filter the whole thing. Here is how we'd find all data where the latitude was greater than -110. ```r samples[ samples$Longitude > -110,] ``` ``` ## # A tibble: 5 × 3 ## Stratum Longitude Latitude ## <chr> <dbl> <dbl> ## 1 156 -110. 24.0 ## 2 73 -110. 24.0 ## 3 98 -110. 23.1 ## 4 32 -109. 26.6 ## 5 102 -109. 26.4 ``` Notice the columns designation is left blank (so we are getting all of them.) --- # Filtering Individual Data We can use some of the fancy string stuff we learned previously to pull out only the names of the sites that match a certain regular expression (here they must start with either `C`, `E`, or `S`). Using the `$` notation returns the results as a vector. ```r samples$Stratum[ str_detect( samples$Stratum, "^[CES]") ] ``` ``` ## [1] "SFr" "Const" "ESan" ``` -- But using the square bracket notation (rows and indicating numerically which column), returns the result as a tibble. ```r samples[str_detect( samples$Stratum, "^[CES]"),1] ``` ``` ## # A tibble: 3 × 1 ## Stratum ## <chr> ## 1 SFr ## 2 Const ## 3 ESan ``` --- # Adding New Data Columns Adding new columns always post-pends them onto the right side of the tibble. ```r samples$ID <- 1:39 samples ``` ``` ## # A tibble: 39 × 4 ## Stratum Longitude Latitude ID ## <chr> <dbl> <dbl> <int> ## 1 88 -114. 29.3 1 ## 2 9 -114. 29.0 2 ## 3 84 -114. 29.0 3 ## 4 175 -113. 28.7 4 ## 5 177 -114. 28.7 5 ## 6 173 -113. 28.4 6 ## 7 171 -113. 28.2 7 ## 8 89 -113. 28.0 8 ## 9 159 -113. 27.5 9 ## 10 SFr -113. 27.4 10 ## # … with 29 more rows ``` --- # Changing Individual Values .pull-left[ By column variable name ```r samples$ID[2] <- 42 samples ``` ``` ## # A tibble: 39 × 4 ## Stratum Longitude Latitude ID ## <chr> <dbl> <dbl> <dbl> ## 1 88 -114. 29.3 1 ## 2 9 -114. 29.0 42 ## 3 84 -114. 29.0 3 ## 4 175 -113. 28.7 4 ## 5 177 -114. 28.7 5 ## 6 173 -113. 28.4 6 ## 7 171 -113. 28.2 7 ## 8 89 -113. 28.0 8 ## 9 159 -113. 27.5 9 ## 10 SFr -113. 27.4 10 ## # … with 29 more rows ``` ] -- .pull-right[ By index coordinate. ```r samples[2,4] <- 24 samples ``` ``` ## # A tibble: 39 × 4 ## Stratum Longitude Latitude ID ## <chr> <dbl> <dbl> <dbl> ## 1 88 -114. 29.3 1 ## 2 9 -114. 29.0 24 ## 3 84 -114. 29.0 3 ## 4 175 -113. 28.7 4 ## 5 177 -114. 28.7 5 ## 6 173 -113. 28.4 6 ## 7 171 -113. 28.2 7 ## 8 89 -113. 28.0 8 ## 9 159 -113. 27.5 9 ## 10 SFr -113. 27.4 10 ## # … with 29 more rows ``` ] --- # Forced Coercion ```r samples$ID[2] <- "Bob" samples ``` ``` ## # A tibble: 39 × 4 ## Stratum Longitude Latitude ID ## <chr> <dbl> <dbl> <chr> ## 1 88 -114. 29.3 1 ## 2 9 -114. 29.0 Bob ## 3 84 -114. 29.0 3 ## 4 175 -113. 28.7 4 ## 5 177 -114. 28.7 5 ## 6 173 -113. 28.4 6 ## 7 171 -113. 28.2 7 ## 8 89 -113. 28.0 8 ## 9 159 -113. 27.5 9 ## 10 SFr -113. 27.4 10 ## # … with 29 more rows ``` --- # Deleting Content .pull-left[ Individual values in a column can be deleted by assigning it `NA`, a missing value. The *Recycle Rule* we saw above, will repeat the `NA` throughout the whole column. ```r samples$ID <- NA samples ``` ``` ## # A tibble: 39 × 4 ## Stratum Longitude Latitude ID ## <chr> <dbl> <dbl> <lgl> ## 1 88 -114. 29.3 NA ## 2 9 -114. 29.0 NA ## 3 84 -114. 29.0 NA ## 4 175 -113. 28.7 NA ## 5 177 -114. 28.7 NA ## 6 173 -113. 28.4 NA ## 7 171 -113. 28.2 NA ## 8 89 -113. 28.0 NA ## 9 159 -113. 27.5 NA ## 10 SFr -113. 27.4 NA ## # … with 29 more rows ``` ] .pull-right[ To entirely delete the column, instead of just assigning all the elemnets to be missing, can be accomplished by setting the whole column equal to `NULL` ```r samples$ID <- NULL samples ``` ``` ## # A tibble: 39 × 3 ## Stratum Longitude Latitude ## <chr> <dbl> <dbl> ## 1 88 -114. 29.3 ## 2 9 -114. 29.0 ## 3 84 -114. 29.0 ## 4 175 -113. 28.7 ## 5 177 -114. 28.7 ## 6 173 -113. 28.4 ## 7 171 -113. 28.2 ## 8 89 -113. 28.0 ## 9 159 -113. 27.5 ## 10 SFr -113. 27.4 ## # … with 29 more rows ``` ] --- # Adding Rows of Content To add additional Rows of content, we need to put the new data into their own `data.frame` or `tibble` ```r tibble( Stratum = c("Los Barriles","Comondu"), Longitude = c(-109.7026, -111.8442), Latitude = c(23.6811, 26.0708) ) -> newSites newSites ``` ``` ## # A tibble: 2 × 3 ## Stratum Longitude Latitude ## <chr> <dbl> <dbl> ## 1 Los Barriles -110. 23.7 ## 2 Comondu -112. 26.1 ``` --- # Adding Rows of Content And then `bind` it onto the existing sample. ```r samples <- rbind( samples, newSites) tail( samples ) ``` ``` ## # A tibble: 6 × 3 ## Stratum Longitude Latitude ## <chr> <dbl> <dbl> ## 1 98 -110. 23.1 ## 2 101 -111. 27.9 ## 3 32 -109. 26.6 ## 4 102 -109. 26.4 ## 5 Los Barriles -110. 23.7 ## 6 Comondu -112. 26.1 ``` --- # Deleting Rows To delete rows, you use negative row indices. ```r dim(samples) ``` ``` ## [1] 41 3 ``` ```r samples <- samples[-41:-39,] dim(samples) ``` ``` ## [1] 38 3 ``` Notice: For all of this "add on" and "delete" stuff, if we want it to **persist** we .red[must] reassign the values back onto the original variable. --- # Real Names While not quite critical here, we often have the need to use more descriptive names for our data columns, some of which need to have spaces to be fully descriptive. One of the last benefits of a `tibble` I'll discuss here, is that it allows for spaces in the names of data columns. ```r names(samples) ``` ``` ## [1] "Stratum" "Longitude" "Latitude" ``` ```r names( samples )[1] <- "Population Name" samples ``` ``` ## # A tibble: 38 × 3 ## `Population Name` Longitude Latitude ## <chr> <dbl> <dbl> ## 1 88 -114. 29.3 ## 2 9 -114. 29.0 ## 3 84 -114. 29.0 ## 4 175 -113. 28.7 ## 5 177 -114. 28.7 ## 6 173 -113. 28.4 ## 7 171 -113. 28.2 ## 8 89 -113. 28.0 ## 9 159 -113. 27.5 ## 10 SFr -113. 27.4 ## # … with 28 more rows ``` --- # Accessing Spaced Out Columns `RStudio` will properly autoinsert all valid column names if you hit the tab button for you. However, if you are doing it manually, surround the name of the data column in a backtick (that is the character on the upper left corner of your keyboard). ```r samples$`Population Name` ``` ``` ## [1] "88" "9" "84" "175" "177" "173" "171" "89" "159" ## [10] "SFr" "160" "162" "12" "161" "93" "165" "169" "58" ## [19] "166" "64" "168" "51" "Const" "77" "164" "75" "163" ## [28] "ESan" "153" "48" "156" "157" "73" "Aqu" "Mat" "98" ## [37] "101" "32" ``` --- ## BIG ACTIVITY - Your DATA! Create a new R script called `data_frames.R` in the project folder. 1. At the top of the file, insert the line `library(tidyverse)` so that the script will load in the proper libraries. 2. Use the function `read_csv()` to load in the file as a variable named something that is meaningful for you. The file *should be* in the same folder as the script (if you followed the instructions from the very first lecture), so you only have to give the name of the file (in quotes-it is a character variable) of the csv file. 3. What are the names of the columns of data in this object? What do they actually measure? 4. How many measurements are taken at each record and how many records are present? 5. How deep was the deepest measurement? What was the coldest temperature? 6. Assign the depth measurement to a variable named `depth` and do the same for the `Do_Optical` data (but of course name it something else appropriate). What are the mean values for each of variables? 7. **BONUS** The function `plot(x,y)` will make a quick scatter plot of two variables with the variables in `x` on the (wait for it) x-axis and those in the variable `y` on the y-axis. Use the data from the last question to make the magnificent display of Do as a function of depth. --- class: middle background-image: url("images/contour.png") background-position: right background-size: auto .center[ ![## Any Questions](https://media.giphy.com/media/G0vYU697uKl0IiIJO2/giphy.gif) ## Ask away! ]