By in large, you will load your data into these structures and pass them to functions for plotting, simulation, plotting, etc. It is a good idea to get used to these.
Creating data.frame
Objects
To make a data.frame
we need to either:
- Make it from scratch using vectors (the more complicated).
- Load it from a file (perhaps the easiest way).
We shall start with the most complicated case.
Creating data.frame
Objects de novo
site <- c( "Const","ESan", "Aqu")
longitude <- c( -111.675, -110.3686, -110.1043)
latitude <- c(25.0247, 24.45879, 23.2855)
sites <- data.frame( Site = site,
Longitude = longitude,
Latitude = latitude )
sites
The thing about data.frame
objects is that the know how to organize and summarize their component variables.
summary(sites)
Site Longitude Latitude
Length:3 Min. :-111.7 Min. :23.29
Class :character 1st Qu.:-111.0 1st Qu.:23.87
Mode :character Median :-110.4 Median :24.46
Mean :-110.7 Mean :24.26
3rd Qu.:-110.2 3rd Qu.:24.74
Max. :-110.1 Max. :25.02
Notice that each variable is summarized a way that gives as much information about the contnets of that variable. The summary()
function is critical and one we will use over and over again.
Loading from File and/or URL
You can also load a data.frame
in using a local file or a URL to a file somewhere on the internet. The easiest way to do this is to have your file as a CSV (Comma-Separated-Value) format. This is usually a \(File \to Save\;As\) kind of routine. It is important for reproducibility and collaboration to 1) Have only one version of your data set, and 2) Keep it in a format that is accessible such as plain text.
Here is the full version of those data loaded above. It is located on a Github website and is accessible to anyone with an internet connection.
url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/arapat.csv"
Once we have the url or file as a character
object, we can use the built-in read.csv()
function to load the data into a local variable. Here is what that looks like.
Samples <- read.csv(url)
summary( Samples )
Stratum Longitude Latitude
Length:39 Min. :-114.3 Min. :23.08
Class :character 1st Qu.:-112.9 1st Qu.:24.52
Mode :character Median :-111.5 Median :26.21
Mean :-111.7 Mean :26.14
3rd Qu.:-110.4 3rd Qu.:27.47
Max. :-109.1 Max. :29.33
Working with data.frame
Objects
A variable representing a data.frame
has properties and constituents just like a vector. The size of the data.frame
is defined as the dim
ensions of it and represents the number of rows and columns it contains.
nrow( Samples )
[1] 39
ncol( Samples )
[1] 3
or use the dim()
function which returns both of them.
dim( Samples )
[1] 39 3
Each variable within the data.frame
is represented by the name we gave it when we added it to the data.frame
.
names( Samples )
[1] "Stratum" "Longitude" "Latitude"
Stratum
Longitude
Latitude
The individual elements are available using the square bracket notation, just like for vectors with the exception that we have two coordinates instead of just one. The first coordinate is for row and the second is for column. So to grab the 10\(^{th}\) Stratum
name (row), which is the 1\(^{st}\) variable (column) in the data.frame
, we use:
Samples[10,1]
[1] "SFr"
SFr
While this is OK, it assumes we remember that the Stratum
variable is the first variable in the data.frame
. If you have data sets with thousands of columns of data in them, it becomes less-than-optimal to use variable numbers to index, which is why we can use the variable name itself in the following format.
Samples$Stratum[10]
[1] "SFr"
SFr
Here, we use the name of the data.frame
connected to the name of the variable using the dollar sign ($) to hook them up.
For readability, it is advised that you use the dollar-sign notation to get and set values within a data.frame
. It will always be more clear to you and anyone who reads your code if you are as verbose as you can be in your code. Trust me, your future self will like you much better if you adopt this habit as soon as possible.
Similar to for vectors, we can grab the head and tail of the whole data frame.
head( Samples )
tail(Samples, n=4)
We can use the same approach for setting values within a data.frame
. Here I’ll expand the SFr notion to represent the full location name.
Samples$Stratum[10] <- "San Francisquito"
head( Samples, n=10)
ALERT: If you received an error message with some like
Warning message:
In `[<-.factor`(`*tmp*`, 10, value = c(30L, 32L, 29L, 19L, 20L, :
invalid factor level, NA generated
It is because the read.csv()
function has a default value that converts any column with a character
variable type into a factor
data type… This is a stupid default option and you should change the default value in your R
options file - To fix this issue, see r environment for or remember that every time you load in data, you must add the optional argument stringsAsFactors=FALSE
to the read.csv()
function call.
Samples <- read.csv(url, stringAsFactors=FALSE)
Run that and try to replace the Stratum name as above.
Slices
Just like for vectors, we can also take slices of components from data.frame
objects. However, what we get in return depends on specifically what we ask for.
For example, taking from a single column in the data.frame
returns a slice of that particular variable type.
x <- Samples[1:3, 1]
x
[1] "88" "9" "84"
88
9
84
class(x)
[1] "character"
character
This is because all the returned values are from a single data column, which by definition must be the same kind of data.
You can ask R
to return it as a data.frame
rather than just a vector (in case you need to have a data.frame to work with) by passing an optional argument to the square brackets. You will notice it retains the row numbers.
x1 <- Samples[ 1:3, 1, drop=FALSE]
x1
class(x1)
[1] "data.frame"
data.frame
However, if we take the same number of object but from a row instead of a column, we can only get a data.frame
object in return.
y <- Samples[1, 1:3]
y
class(y)
[1] "data.frame"
data.frame
because this time, we are asking for the three variables representing the first record. Because these values are being drawn from across different columns of data, R
does not concatenate these into a single vector and coearce them to the least common data type. Rather it makes a new data.frame
object for you and returns that.
Modifying Existing Data Frames
Once we have a data.frame
in memory, we can also add to it. If we are adding columns, we can append a single varible onto it and R
will put it on the far right side of the data.frame
.
Samples$ID <- 1:nrow(Samples)
summary( Samples )
Stratum Longitude Latitude ID
Length:39 Min. :-114.3 Min. :23.08 Min. : 1.0
Class :character 1st Qu.:-112.9 1st Qu.:24.52 1st Qu.:10.5
Mode :character Median :-111.5 Median :26.21 Median :20.0
Mean :-111.7 Mean :26.14 Mean :20.0
3rd Qu.:-110.4 3rd Qu.:27.47 3rd Qu.:29.5
Max. :-109.1 Max. :29.33 Max. :39.0
If we want to add rows to the data.frame
, it is a bit more involved because we are going to have to either a list object whose variables are of the same order as the original data.frame
or anoher entire data.frame
(whose names are identical to the original)
names(Samples)
[1] "Stratum" "Longitude" "Latitude" "ID"
Stratum
Longitude
Latitude
ID
To add a single row, we use rbind()
and pass it the data.frame
and a new list object.
Samples <- rbind( Samples, list("Los Cabos",-109.7124, 23.0799, 40))
tail(Samples)
You can also use another data.frame
, which in this case must have the same named varaibles as the original.
moresites <- data.frame( ID = 41:42,
Stratum = c("Los Barriles","Comondu"),
Longitude = c(-109.7026, -111.8442),
Latitude = c(23.6811, 26.0708)
)
names(moresites)
[1] "ID" "Stratum" "Longitude" "Latitude"
ID
Stratum
Longitude
Latitude
However, notice that the order in which these varaibles in moresites
is different than those in Samples
. The rbind()
function rearranges the order when it concatenates onto the rows.
Samples <- rbind( Samples, moresites)
tail( Samples )
To delete a column of data, assign it the value of NULL
.
Samples$ID <- NULL
summary(Samples)
Stratum Longitude Latitude
Length:42 Min. :-114.3 Min. :23.08
Class :character 1st Qu.:-112.8 1st Qu.:24.28
Mode :character Median :-111.5 Median :26.04
Mean :-111.6 Mean :26.01
3rd Qu.:-110.3 3rd Qu.:27.39
Max. :-109.1 Max. :29.33
or use negative indices to delete rows.
Samples <- Samples[ -42:-40, ]
summary(Samples)
Stratum Longitude Latitude
Length:39 Min. :-114.3 Min. :23.08
Class :character 1st Qu.:-112.9 1st Qu.:24.52
Mode :character Median :-111.5 Median :26.21
Mean :-111.7 Mean :26.14
3rd Qu.:-110.4 3rd Qu.:27.47
Max. :-109.1 Max. :29.33
The data.frame
object is a fundamental component of your R
work flow and being able to manipulate data within it and extract data from it are a huge part of becoming data literate.
---
title: "Data Frames"
output: 
  html_notebook:
    css: "envs543-styles.css"
---

> Data frames are a structure that can hold many different data types in one simple structure.

Data frames are the *lingua franca* for `R`, especially once we start getting into more complicated analysis and manipulation.  For simplicity, one can consider a `data.frame` object much like a spreadsheet.  Each row represents a record on some object and each column—consisting of different kinds of data—are measurements on that object.  

By in large, you will load your data into these structures and pass them to functions for plotting, simulation, plotting, etc.  It is a good idea to get used to these.


## Creating `data.frame` Objects

To make a `data.frame` we need to either:

  1. Make it from scratch using vectors (the more complicated).
  2. Load it from a file (perhaps the easiest way).

We shall start with the most complicated case.  

### Creating `data.frame` Objects *de novo*

```{r}
site <- c( "Const","ESan", "Aqu")
longitude <- c( -111.675, -110.3686, -110.1043)
latitude <- c(25.0247, 24.45879, 23.2855)

sites <- data.frame( Site = site,
                     Longitude = longitude,
                     Latitude = latitude )
sites
```

The thing about `data.frame` objects is that the know how to organize and summarize their component variables.

```{r}
summary(sites)
```

Notice that each variable is summarized a way that gives as much information about the contnets of that variable. The `summary()` function is critical and one we will use over and over again.

### Loading from File and/or URL

You can also load a `data.frame` in using a local file or a URL to a file somewhere on the internet.  The easiest way to do this is to have your file as a CSV (Comma-Separated-Value) format.  This is usually a $File \to Save\;As$ kind of routine.  It is important for reproducibility and collaboration to 1) Have only one version of your data set, and 2) Keep it in a format that is accessible such as plain text.

Here is the full version of those data loaded above.  It is located on a [Github](https://github.com/dyerlab/ENVS-Lectures) website and is accessible to anyone with an internet connection.

```{r}
url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/arapat.csv"
```

Once we have the url or file as a `character` object, we can use the built-in `read.csv()` function to load the data into a local variable.  Here is what that looks like.

```{r}
Samples <- read.csv(url)
summary( Samples )
```



## Working with `data.frame` Objects

A variable representing a `data.frame` has properties and constituents just like a vector.  The size of the `data.frame` is defined as the `dim`ensions of it and represents the number of *rows* and *columns* it contains.

```{r}
nrow( Samples )
ncol( Samples )
```

or use the `dim()` function which returns both of them.

```{r}
dim( Samples )
```


Each variable within the `data.frame` is represented by the name we gave it when we added it to the `data.frame`.

```{r}
names( Samples )
```

The individual elements are available using the square bracket notation, just like for [vectors](../vectors/narriative.nb.html) *with the exception that* we have two coordinates instead of just one.  The first coordinate is for *row* and the second is for *column*.  So to grab the 10$^{th}$ `Stratum` name (row), which is the 1$^{st}$ variable (column) in the `data.frame`, we use:

```{r}
Samples[10,1]
```

While this is OK, it assumes we remember that the `Stratum` variable is the first variable in the `data.frame`.  If you have data sets with thousands of columns of data in them, it becomes less-than-optimal to use variable numbers to index, which is why we can use the variable name itself in the following format.

```{r}
Samples$Stratum[10]
```

Here, we use the name of the `data.frame` connected to the name of the variable using the dollar sign (\$) to hook them up.  

<div class="box-yellow">For readability, it is advised that you use the dollar-sign notation to get and set values within a `data.frame`.  It will always be more clear to you and anyone who reads your code if you are as verbose as you can be in your code.  Trust me, your *future self* will like you much better if you adopt this habit as soon as possible.</div>


Similar to for [vectors](../vectors/narriative.nb.html), we can grab the head and tail of the whole data frame.

```{r}
head( Samples )
```

```{r}
tail(Samples, n=4)
```


We can use the same approach for setting values within a `data.frame`.  Here I'll expand the *SFr* notion to represent the full location name.

```{r}
Samples$Stratum[10] <- "San Francisquito"
head( Samples, n=10)
```

<div class="box-red">
**ALERT:**  If you received an error message with some like 
```
Warning message:
In `[<-.factor`(`*tmp*`, 10, value = c(30L, 32L, 29L, 19L, 20L,  :
  invalid factor level, NA generated
```
It is because the `read.csv()` function has a default value that converts any column with a `character` variable type into a `factor` data type...  This is a **stupid** default option and you should change the default value in your `R` options file - To fix this issue, see [r environment](../environment/narriative.nb.html) for or remember that *every time you load in data*, you must add the optional argument `stringsAsFactors=FALSE` to the `read.csv()` function call.

```{r eval=FALSE}
Samples <- read.csv(url, stringAsFactors=FALSE)
```

Run that and try to replace the Stratum name as above.
</div>



### Slices

Just like for [vectors](../vectors/narriative.nb.html), we can also take slices of components from `data.frame` objects.  However, what we get in return depends on specifically what we ask for.  

For example, taking from a single column in the `data.frame` returns a slice of that particular variable type.

```{r}
x <- Samples[1:3, 1]
x
class(x)
```

This is because all the returned values are from a single data column, which by definition *must* be the same kind of data.  

You can ask `R` to return it as a `data.frame` rather than just a vector (in case you need to have a data.frame to work with) by passing an optional argument to the square brackets.  You will notice it retains the row numbers.

```{r}
x1 <- Samples[ 1:3, 1, drop=FALSE]
x1
class(x1)
```


However, if we take the same number of object but from a row instead of a column, we can *only* get a `data.frame` object in return.

```{r}
y <- Samples[1, 1:3]
y
class(y)
```

because this time, we are asking for the three variables representing the first record.  Because these values are being drawn from across different columns of data, `R` does not concatenate these into a single vector and coearce them to the least common data type. Rather it makes a new `data.frame` object for you and returns that.


### Modifying Existing Data Frames

Once we have a `data.frame` in memory, we can also add to it.  If we are adding columns, we can append a single varible onto it and `R` will put it on the far right side of the `data.frame`.

```{r}
Samples$ID <- 1:nrow(Samples)
summary( Samples )
```


If we want to add rows to the `data.frame`, it is a bit more involved because we are going to have to either a list object whose variables are of the same order as the original `data.frame` or anoher entire `data.frame` (whose names are identical to the original)

```{r}
names(Samples)
```

To add a single row, we use `rbind()` and pass it the `data.frame` and a new [list](../lists/narriative.nb.html) object.

```{r}
Samples <- rbind( Samples, list("Los Cabos",-109.7124, 23.0799, 40))
tail(Samples)
```

You can also use another `data.frame`, which in this case must have the same named varaibles as the original.  

```{r}
moresites <- data.frame( ID = 41:42,
                         Stratum = c("Los Barriles","Comondu"),
                         Longitude = c(-109.7026, -111.8442),
                         Latitude = c(23.6811, 26.0708) 
                         )
names(moresites)
```

However, notice that the order in which these varaibles in `moresites` is different than those in `Samples`.  The `rbind()` function rearranges the order when it concatenates onto the rows.

```{r}
Samples <- rbind( Samples, moresites)
tail( Samples )
```

To delete a column of data, assign it the value of `NULL`.

```{r}
Samples$ID <- NULL
summary(Samples)
```

or use negative indices to delete rows.

```{r}
Samples <- Samples[ -42:-40, ]
summary(Samples)
```

---

The `data.frame` object is a fundamental component of your `R` work flow and being able to manipulate data within it and extract data from it are a huge part of becoming data literate.



















