Factors allow us to represent a type of data that exclusive and categorical. These data may, or may not, be ordered and in most cases, we can think of these kinds of data to represent things like, treatment levels, sampling locations, etc.
I’m going to start with some days of the week because they are exclusive (e.g., you cannot be in both Monday and Wednesday at the same time). Factors are initially created from string objects, though you could use numeric
data but that would be stupid because you should use things that are descriptive and strings are much better than that.
weekdays <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday", "Sunday")
class( weekdays )
[1] "character"
I’m going to take these days and random sample them to create a vector of 40 elements. This is something we do all the time and there is a sample()
function that allows us to draw random samples either with or without replacement (e.g., can you select the same value more than once).
data <- sample( weekdays, size=40, replace=TRUE)
data
[1] "Saturday" "Saturday" "Thursday" "Monday" "Thursday" "Tuesday" "Monday" "Thursday" "Monday" "Saturday" "Wednesday"
[12] "Thursday" "Tuesday" "Friday" "Friday" "Wednesday" "Saturday" "Sunday" "Friday" "Wednesday" "Saturday" "Wednesday"
[23] "Monday" "Monday" "Wednesday" "Monday" "Tuesday" "Tuesday" "Saturday" "Wednesday" "Wednesday" "Tuesday" "Saturday"
[34] "Sunday" "Thursday" "Sunday" "Wednesday" "Saturday" "Saturday" "Sunday"
These data are still
class( data )
[1] "character"
To turn them into a factor, we use…. factor()
days <- factor( data )
is.factor( days )
[1] TRUE
class( days )
[1] "factor"
Now when we look at the data, it looks a lot like it did before except for the last line which shows you the unique levels for elements in the vector.
days
[1] Saturday Saturday Thursday Monday Thursday Tuesday Monday Thursday Monday Saturday Wednesday Thursday Tuesday
[14] Friday Friday Wednesday Saturday Sunday Friday Wednesday Saturday Wednesday Monday Monday Wednesday Monday
[27] Tuesday Tuesday Saturday Wednesday Wednesday Tuesday Saturday Sunday Thursday Sunday Wednesday Saturday Saturday
[40] Sunday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
summary( days )
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
3 6 9 4 5 5 8
We can put them into data frames and they know how to summarize themselves properly by counting the number of occurances of each level.
df <- data.frame( ID = 1:40, Weekdays = days )
summary( df )
ID Weekdays
Min. : 1.00 Friday :3
1st Qu.:10.75 Monday :6
Median :20.50 Saturday :9
Mean :20.50 Sunday :4
3rd Qu.:30.25 Thursday :5
Max. :40.00 Tuesday :5
Wednesday:8
And we can directly access the unique levels
levels( days )
[1] "Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday" "Wednesday"
So factors can be categorical (e.g., one is just different than the next) and compared via ==
and !=
values. Or they can be ordinal such that >
and <
make sense.
By default, a factor
is not ordered.
is.ordered( days )
[1] FALSE
days[1] < days[2]
‘<’ not meaningful for factors
[1] NA
data <- factor( days, ordered=TRUE )
data
[1] Saturday Saturday Thursday Monday Thursday Tuesday Monday Thursday Monday Saturday Wednesday Thursday Tuesday
[14] Friday Friday Wednesday Saturday Sunday Friday Wednesday Saturday Wednesday Monday Monday Wednesday Monday
[27] Tuesday Tuesday Saturday Wednesday Wednesday Tuesday Saturday Sunday Thursday Sunday Wednesday Saturday Saturday
[40] Sunday
Levels: Friday < Monday < Saturday < Sunday < Thursday < Tuesday < Wednesday
So that if we go and try to order them, the only way they can be sorted is alphabetically.
sort( data )
[1] Friday Friday Friday Monday Monday Monday Monday Monday Monday Saturday Saturday Saturday Saturday
[14] Saturday Saturday Saturday Saturday Saturday Sunday Sunday Sunday Sunday Thursday Thursday Thursday Thursday
[27] Thursday Tuesday Tuesday Tuesday Tuesday Tuesday Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday
[40] Wednesday
Levels: Friday < Monday < Saturday < Sunday < Thursday < Tuesday < Wednesday
However, this does not make sense. Who in their right mind would like to have Friday
followed immediately by Monday
? That is just not right!
To establish an ordinal variable with a specified sequence of values that are not alphabetical we need to pass along the levels themselves.
data <- factor( days, ordered=TRUE, levels = weekdays )
data
[1] Saturday Saturday Thursday Monday Thursday Tuesday Monday Thursday Monday Saturday Wednesday Thursday Tuesday
[14] Friday Friday Wednesday Saturday Sunday Friday Wednesday Saturday Wednesday Monday Monday Wednesday Monday
[27] Tuesday Tuesday Saturday Wednesday Wednesday Tuesday Saturday Sunday Thursday Sunday Wednesday Saturday Saturday
[40] Sunday
Levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday
Now they’ll sort properly.
sort( data )
[1] Monday Monday Monday Monday Monday Monday Tuesday Tuesday Tuesday Tuesday Tuesday Wednesday Wednesday
[14] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday Thursday Thursday Thursday Thursday Thursday Friday Friday
[27] Friday Saturday Saturday Saturday Saturday Saturday Saturday Saturday Saturday Saturday Sunday Sunday Sunday
[40] Sunday
Levels: Monday < Tuesday < Wednesday < Thursday < Friday < Saturday < Sunday
Once you establish a factor, you cannot set the values to anyting that is outside of the pre-defined levels. If you do, it will just put in missing data NA
.
days[3] <- "Bob"
invalid factor level, NA generated
days
[1] Saturday Saturday <NA> Monday Thursday Tuesday Monday Thursday Monday Saturday Wednesday Thursday Tuesday
[14] Friday Friday Wednesday Saturday Sunday Friday Wednesday Saturday Wednesday Monday Monday Wednesday Monday
[27] Tuesday Tuesday Saturday Wednesday Wednesday Tuesday Saturday Sunday Thursday Sunday Wednesday Saturday Saturday
[40] Sunday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
That being said, we can have more levels in the factor than observed in the data. Here is an example of just grabbing the work days from the week but making the levels equal to all the potential weekdays.
workdays <- sample( weekdays[1:5], size=40, replace = TRUE )
workdays <- factor( workdays, ordered=TRUE, levels = weekdays )
And when we summarize it, we see that while it is possible that days may be named Saturday
and Sunday
, they are not recoreded in the data we have for workdays.
summary( workdays )
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
11 6 9 3 11 0 0
We can drop the levels that have no representation
workdays <- droplevels( workdays )
summary( workdays )
Monday Tuesday Wednesday Thursday Friday
11 6 9 3 11
forcats
The forcats
library has a bunch of helper functions for working with factors. This is a relatively small library in tidyverse
but a powerful one. I would recommend looking at the cheatsheet for it to get a more broad understanding of what functions in this library can do.
library( forcats )
–
Just like stringr
had the str_
prefix, all the functions here have the fct_
prefix. Here are some examples.
Counting how many of each factor
fct_count( data )
Lumping Rare Factors
lumped <- fct_lump_min( data, min = 5 )
fct_count( lumped )
Reordering Factor Levels by Frequency
freq <- fct_infreq( data )
levels( freq )
[1] "Saturday" "Wednesday" "Monday" "Tuesday" "Thursday" "Sunday" "Friday"
Reordering by Order of Appearance
ordered <- fct_inorder( data )
levels( ordered )
[1] "Saturday" "Thursday" "Monday" "Tuesday" "Wednesday" "Friday" "Sunday"
Reordering Specific Levels
newWeek <- fct_relevel( data, "Sunday")
levels( newWeek )
[1] "Sunday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
Dropping Unobserved Levels - just like droplevels()
dropped <- fct_drop( workdays )
summary( dropped )
Monday Tuesday Wednesday Thursday Friday
11 6 9 3 11
It is common to use factors as an organizing princple in our data. For example, let’s say we went out and sampled three different species of plants and measured characteristics of their flower size. The iris
data set from R.A. Fisher is a classid data set that is include in R
and it looks like this.
iris
By default it is a data.frame
object.
class( iris )
[1] "data.frame"
One helpful function in base R
is the by()
function. It has the following form.
by( data, index, function)
The data
is the raw data you are using, the index
is a vector that we are using to differentiate among the species (the factor), and the function is what function we want to use.
So for example, if I were interesed in the mean length of the Sepal for each species, I could write.
meanSepalLength <- by( iris$Sepal.Length, iris$Species, mean )
class( meanSepalLength )
[1] "by"
meanSepalLength
iris$Species: setosa
[1] 5.006
-----------------------------------------------------------------------------------------------------------
iris$Species: versicolor
[1] 5.936
-----------------------------------------------------------------------------------------------------------
iris$Species: virginica
[1] 6.588
I could also do the same thing with the variance in sepal length.
by( iris[,2], iris[,5], var ) -> varSepalLength
varSepalLength
iris[, 5]: setosa
[1] 0.1436898
-----------------------------------------------------------------------------------------------------------
iris[, 5]: versicolor
[1] 0.09846939
-----------------------------------------------------------------------------------------------------------
iris[, 5]: virginica
[1] 0.1040041
Using these kinds of functions we can create a summary data frame.
df <- tibble( Species = levels( iris$Species),
Average = meanSepalLength,
Variance = varSepalLength
)
df
Missing data is a .red[fact of life] and R
is very opinionated about how it handles missing values. In general, missing data is encoded as NA
and is a valid entry for any data type (character, numeric, logical, factor, etc.). Where this becomes tricky is when we are doing operations on data that has missing values. R
could take two routes:
Fortunately, R
took the second route.
An example from the iris data, I’m going to add some missing data to it.
missingIris <- iris[, 4:5]
missingIris$Petal.Width[ c(2,6,12) ] <- NA
summary( missingIris )
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.218
3rd Qu.:1.800
Max. :2.500
NA's :3
Notice how the missing data is denoted in the summary.
When we perform a mathematical or statistical operation on data that has missing elements R
will always return NA as the result.
mean( missingIris$Petal.Width )
[1] NA
This warns you that .red[at least one] of the observations in the data is missing.
Same output for using by()
, it will put NA
into each level that has at least one missing value.
by( missingIris$Petal.Width, missingIris$Species, mean )
missingIris$Species: setosa
[1] NA
-----------------------------------------------------------------------------------------------------------
missingIris$Species: versicolor
[1] 1.326
-----------------------------------------------------------------------------------------------------------
missingIris$Species: virginica
[1] 2.026
To acknowledge that there are missing data and you still want the values, you need to tell the function you are using that data is missing and you are OK with that using the optional argument na.rm=TRUE
(na
= missing data & rm
is remove).
mean( missingIris$Petal.Width, na.rm=TRUE)
[1] 1.218367
To pass this to the by()
function, we add the optional argument na.rm=TRUE
and by()
passes it along to the mean
function as “…”
by( missingIris$Petal.Width, missingIris$Species, mean, na.rm=TRUE )
missingIris$Species: setosa
[1] 0.2446809
-----------------------------------------------------------------------------------------------------------
missingIris$Species: versicolor
[1] 1.326
-----------------------------------------------------------------------------------------------------------
missingIris$Species: virginica
[1] 2.026
Making data frames like that above is a classic maneuver in R
and I’m going to use this to introduce the use of the knitr
library to show you how to take a set of data and turn it into a table for your manuscript.
library( knitr )
Now we can make a table as:
kable( df )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
We can even add a caption to it.
irisTable <- kable( df, caption = "The mean and variance in measured sepal length (in cm) for three species of Iris.")
irisTable
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
In addition to this basic library, there is an kableExtra
one that allows us to get even more fancy. You must go check out this webpage (which is an RMarkdown page by the way) to see all the other ways you can fancy up your tables.
library( kableExtra )
Here are some examples Themes
kable_paper( irisTable )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_classic( irisTable )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_classic_2( irisTable )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_minimal( irisTable )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_material( irisTable,lightable_options = c("striped", "hover") )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_material_dark( irisTable )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
We can be specific about the size and location of the whole table.
kable_paper(irisTable, full_width = FALSE )
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
kable_paper( irisTable, full_width=FALSE, position="right")
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
And even embed it in a bunch of text and float it to left or right (I added echo=FALSE
to the chunck header so it hides itself).
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit libero sit amet porta elementum. In imperdiet tellus non odio porttitor auctor ac sit amet diam. Suspendisse eleifend vel nisi nec efficitur. Ut varius urna lectus, ac iaculis velit bibendum eget. Curabitur dignissim magna eu odio sagittis blandit. Vivamus sed ipsum mi. Etiam est leo, mollis ultrices dolor eget, consectetur euismod augue. In hac habitasse platea dictumst. Integer blandit ante magna, quis volutpat velit varius hendrerit. Vestibulum sit amet lacinia magna. Sed at varius nisl. Donec eu porta tellus, vitae rhoncus velit.
Species | Average | Variance |
---|---|---|
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |
Maecenas euismod mattis neque. Ut at sapien lacinia, vehicula felis vitae, laoreet odio. Cras ut magna sed sapien scelerisque auctor maximus tincidunt arcu. Praesent vel accumsan leo. Etiam tempor leo placerat, commodo ante eu, posuere ligula. Sed purus justo, feugiat vel volutpat in, faucibus quis sem. Vivamus enim lacus, ultrices id erat in, posuere fringilla est. Nulla porttitor ac nunc nec efficitur. Duis tincidunt metus leo, at lacinia orci tristique in.
Nulla nec elementum nibh, quis congue augue. Vivamus fermentum nec mauris nec vehicula. Proin laoreet sapien quis orci mollis, et condimentum ante tempor. Vivamus hendrerit ut sem a iaculis. Quisque mauris enim, accumsan sit amet fermentum quis, convallis a nisl. Donec elit orci, consectetur id vestibulum in, elementum nec magna. In lobortis erat velit. Nam sit amet finibus arcu.
We can do some really cool stuff on row and column headings. Here is an example where I add another row above the data columns for output.
classic <- kable_paper( irisTable )
add_header_above( classic, c(" " = 1, "Sepal Length (cm)" = 2))
Sepal Length (cm) |
||
---|---|---|
Species | Average | Variance |
setosa | 5.006 | 0.14368980 |
versicolor | 5.936 | 0.09846939 |
virginica | 6.588 | 0.10400408 |