Ordination is a technique to take high-dimensional data and to represent it in a lower dimensional context.
Ordination is a collective term for multivariate techniques which summarize a multidimensional dataset in such a way that when it is projected onto a low dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection1.
One of the largest challenges in data analysis is the ability to understand and gain inferences from it! This is especially compounded when we have many different kinds of data describing our individual observations. For example, at a particular vernal pool, we may have measured pool size, pool depth, elevation, rainfall, temperature, canopy cover, pH, aquatic vegitation, species1 density, species2 density, etc. To describe all of these variables we could either plot all combinations of them or be a bit clever and use some ordination approaches.
For this activity, I am going to use the beer styles as a data set in explaining a couple of different types of ordination. It is available as the raw CSV file.
url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/Beer_Styles.csv"
read_csv( url ) %>%
mutate( Yeast = as.factor( Yeast ) ) -> data
summary(data)
Styles Yeast ABV_Min ABV_Max IBU_Min IBU_Max SRM_Min
Length:100 Ale :69 Min. :2.400 Min. : 3.200 Min. : 0.00 Min. : 8.00 Min. : 2.00
Class :character Either: 4 1st Qu.:4.200 1st Qu.: 5.475 1st Qu.:15.00 1st Qu.: 25.00 1st Qu.: 3.50
Mode :character Lager :27 Median :4.600 Median : 6.000 Median :20.00 Median : 35.00 Median : 8.00
Mean :4.947 Mean : 6.768 Mean :21.97 Mean : 38.98 Mean : 9.82
3rd Qu.:5.500 3rd Qu.: 8.000 3rd Qu.:25.00 3rd Qu.: 45.00 3rd Qu.:14.00
Max. :9.000 Max. :14.000 Max. :60.00 Max. :120.00 Max. :30.00
SRM_Max OG_Min OG_Max FG_Min FG_Max
Min. : 3.00 Min. :1.026 Min. :1.032 Min. :0.998 Min. :1.006
1st Qu.: 7.00 1st Qu.:1.040 1st Qu.:1.052 1st Qu.:1.008 1st Qu.:1.012
Median :17.00 Median :1.046 Median :1.060 Median :1.010 Median :1.015
Mean :17.76 Mean :1.049 Mean :1.065 Mean :1.009 Mean :1.016
3rd Qu.:22.00 3rd Qu.:1.056 3rd Qu.:1.075 3rd Qu.:1.010 3rd Qu.:1.018
Max. :40.00 Max. :1.080 Max. :1.130 Max. :1.020 Max. :1.040
These data give ranges of values but it is probably easier if we just take the midpoint of the range.
data %>%
mutate( ABV=( ABV_Max+ABV_Min)/2,
IBU=( IBU_Max+IBU_Min)/2,
SRM=( SRM_Max+SRM_Min)/2,
OG=( OG_Max+OG_Min)/2,
FG=( FG_Max+FG_Min)/2 ) %>%
select( Styles, Yeast, ABV, IBU, SRM, OG, FG) -> beers
summary( beers)
Styles Yeast ABV IBU SRM OG FG
Length:100 Ale :69 Min. : 2.850 Min. : 5.00 Min. : 2.50 Min. :1.030 Min. :1.003
Class :character Either: 4 1st Qu.: 4.900 1st Qu.:21.38 1st Qu.: 5.00 1st Qu.:1.047 1st Qu.:1.010
Mode :character Lager :27 Median : 5.300 Median :26.25 Median :12.75 Median :1.052 Median :1.012
Mean : 5.857 Mean :30.48 Mean :13.79 Mean :1.057 Mean :1.013
3rd Qu.: 6.750 3rd Qu.:37.50 3rd Qu.:18.00 3rd Qu.:1.065 3rd Qu.:1.014
Max. :11.500 Max. :90.00 Max. :35.00 Max. :1.100 Max. :1.029
Excellent. If we look a the data now, we can see that there are a moderate amount of correlation between data types and all of the characteristics are spread reasonably well across the Yeast types. Here is a pairwise plot of all the data using the GGally::ggpairs()
function.
library(GGally)
beers %>%
select( -Styles ) %>%
ggpairs()
Principle component analysis (PCA) is a translation of the original data into new coordinate spaces. This has absolutely nothing to do with the relationship among the data themselves but is more of a way to create new coordinates for each data point under the following criteria:
1. The number of axes in the translated data are the same as the number of axes in the original data. 2. Axes are chosen by taking all the data and finding transects through it that account for the broadest variation in the data. 3. Each axis is defined as a linear combination of the original axes. 3. Subsequent axes must be orthoganal to all previous ones (e.g., at 90\(\deg\) angles). 4. The amount of the total variation in the system can be partitioned by these new axes and they are ordered from those that explain the most variation to those who explain the least.
An exmaple of this rotation is given below.
A rotation of 2-dimenational data from the original coordinate space (represented by the x- and y-axes) onto synthetic principal component (the red axes). The rotation itself maximizes the distributional width of the data (depicted as density plots in grey for the original axes and red for the rotated axes).
To conduct this rotation on our data, we use the function prcomp()
. It does the rotation and returns an analysis object that has all the information we need in it.
pc.fit <- prcomp(beers[,3:7])
names( pc.fit)
[1] "sdev" "rotation" "center" "scale" "x"
If we look at the raw analysis output, we see a summary of the amount of data explained by each of the axes as well as the loadings (e.g., the linear combinations of the original data that translate the old coordinates into the new ones).
pc.fit
Standard deviations (1, .., p=5):
[1] 17.407474934 8.593744939 1.537237291 0.004779020 0.001954314
Rotation (n x k) = (5 x 5):
PC1 PC2 PC3 PC4 PC5
ABV 0.0500642463 0.0005333582 -0.998701196 -8.619900e-03 -3.860675e-03
IBU 0.9773017876 0.2060831568 0.049101404 9.539617e-06 2.089354e-05
SRM 0.2058507906 -0.9785343146 0.009797993 -1.998978e-04 8.053623e-05
OG 0.0004724918 -0.0001184112 -0.009304602 8.330098e-01 5.531798e-01
FG 0.0001261491 -0.0001705335 -0.001548091 5.531911e-01 -8.330529e-01
We can plot these and by default it shows the variation explained by each axis.
plot( pc.fit )
This rotation seems to be able to produce axes that account for a lot of the underyling variation. Here is a synopsis:
format( pc.fit$sdev / sum( pc.fit$sdev ), digits=3)
[1] "6.32e-01" "3.12e-01" "5.58e-02" "1.73e-04" "7.09e-05"
So, the first axis describes 63% of the variation and the second describes 31%, etc.
We can plot the original data points, projected into this new coordiante space.
data.frame( predict( pc.fit )) %>%
mutate( Yeast = beers$Yeast,
Style = beers$Styles ) -> predicted
ggplot( predicted ) +
geom_point( aes(PC1, PC2, color=Yeast), size=4 )
Pielou EC, (1984) The interpretation of ecological data: A primer on classification and ordination. 288pg. ISBN: ↩︎