Vectors and data frames are the foundation of data analysis in R. Essentially, most everything we work with will be contained within one of these container types. As such, it is important for us to get a good understanding and gain a high level of comfort and understanding of how to access and set data in these structures.
In this and most of the following homework and presentations, I will use the generic term “data frame” to indicate a suite of data that has several records and individual measurements on each record. These will typically be tibble
objects rather than the older data.frame
ones unless otherwise stated. As such, we shall need to import tidyverse
at the beginning.
library( tidyverse )
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.4 ✓ dplyr 1.0.7
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 2.0.1 ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
In the chunk below, create three vectors of data. one for for names, one for age, and another for grade.
# vectors of different types
Now, create a tibble from these vectors.
# tibble
Add a new column of data to this data frame (you can make it up, which is what I did with the data above…).
## Add new column
Add a new row of data to the data.frame
then summarize it.
## Add new row
In reality, we spend very little time working with data as small as this, so let’s jump into a slightly larger data set. There is a built-in data set measuring air quality in New York entitled airquality
(I know, tricky right?). It is available on every stock installation and this is what it looks like.
summary( airquality )
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
Let’s make a copy of these built-in data and turn it into a tibble.
data <- as_tibble( airquality )
summary( data )
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
Manipulate the data frame in the following ways:
## New Column for compound dates
data
# Convert F -> C
Temp
column to Temperature °C
and the Solar.R
to Solar Radiation
in an attempt to practice what is known as ‘Literate Programming’.# set proper names for columns
And show your creation by using head()
to reveal the first 6 rows.
### Show the first few rows
OK, now we have some data to work with, use that data frame to extract the following information and answer the following questions.
What were the hottest and coldest dates recorded in this data set?
How many of the days in the data recorded higher than the average wind speed?
How many rows of data are there with no missing values for any of recorded observations?
On what days was the solar radiation greater than 300 Langleys in Central Park?