In the previous section, we examined how to recast the data into new coordinant spaces for visualization. Another ordination approach is to try to cluster our data. Hierarchical clustering are very helpful in understanding groupings in the data, particularly if there is a ‘nesting’ structure. While there are many ways to do it, they all generally proceed as follows:
1. Define a numeric metric that measured the distances between all \(K\) observations.
2. Find the two groups that have the smallest distance and coalesce them together into a pair.
3. Assume that the coalesced pair now constitutes a single entity, estimate the numeric metric among all \(K-1\) groups in the data set.
4. Go to #2 and repeat until you have coalesced all the groups together.
Here is an example of how this works. Consider the data below:
x <- c(1, 2, 0, 3, 6)
y <- c(0, 3, 4, 2, 3)
coords <- cbind(x,y)
rownames(coords) <- LETTERS[1:5]
coords
x y
A 1 0
B 2 3
C 0 4
D 3 2
E 6 3
If you assume a Euclidean Distance, you can see the pairwise distances as:
dist( coords )
A B C D
B 3.162278
C 4.123106 2.236068
D 2.828427 1.414214 3.605551
E 5.830952 4.000000 6.082763 3.162278
Notice how the rownames
on the coords
object are perpetuated into the distance matrix. Here, \(B\) and \(D\) are the closest object. We can coalesce them by taking their mean coordinate. Here I manually take the 2nd and 4th entries and replace them with the mean values, I label them as \((BD)\)
x <- c(1, (2+3)/2, 0, 6)
y <- c(0, (3+2)/2, 4, 3)
coords <- cbind(x,y)
rownames(coords) <- c("A","(BD)","C","E")
coords
x y
A 1.0 0.0
(BD) 2.5 2.5
C 0.0 4.0
E 6.0 3.0
dist(coords)
A (BD) C
(BD) 2.915476
C 4.123106 2.915476
E 5.830952 3.535534 6.082763
Now, \(A\) and \(BD\) are the closest so I coalesce these into \((A(BD))\), take their mean coordinate and redo it again.
x <- c( (1 +(2+3)/2)/2, 0, 6)
y <- c( (0 +(3+2)/2)/2, 4, 3)
coords <- cbind(x,y)
rownames(coords) <- c("(A(BD))","C","E")
coords
x y
(A(BD)) 1.75 1.25
C 0.00 4.00
E 6.00 3.00
dist(coords)
(A(BD)) C
C 3.259601
E 4.596194 6.082763
Then
x <- c( ((1 +(2+3)/2)/2 + 0)/2, 6)
y <- c( ((0 +(3+2)/2)/2+ 4)/2, 3)
coords <- cbind(x,y)
rownames(coords) <- c("((A(BD))C)","E")
coords
x y
((A(BD))C) 0.875 2.625
E 6.000 3.000
dist(coords)
((A(BD))C)
E 5.138701
Which is obviously resulting in \(((A(BD))C)E\) as \(E\) is the only one left. This means that if you group these, \(BD\) go together, then \(A\) attaches to that group then \(C\) then \(E\). Boom! You just did a hierarchical clustering!
For a slightly more complicated version, we will be using the beer style
data set. It can be loaded in from the Github site.
url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/Beer_Styles.csv"
read_csv( url ) %>%
mutate( Yeast = as.factor( Yeast ) ) -> data
summary(data)
Styles Yeast ABV_Min ABV_Max
Length:100 Ale :69 Min. :2.400 Min. : 3.200
Class :character Either: 4 1st Qu.:4.200 1st Qu.: 5.475
Mode :character Lager :27 Median :4.600 Median : 6.000
Mean :4.947 Mean : 6.768
3rd Qu.:5.500 3rd Qu.: 8.000
Max. :9.000 Max. :14.000
IBU_Min IBU_Max SRM_Min SRM_Max
Min. : 0.00 Min. : 8.00 Min. : 2.00 Min. : 3.00
1st Qu.:15.00 1st Qu.: 25.00 1st Qu.: 3.50 1st Qu.: 7.00
Median :20.00 Median : 35.00 Median : 8.00 Median :17.00
Mean :21.97 Mean : 38.98 Mean : 9.82 Mean :17.76
3rd Qu.:25.00 3rd Qu.: 45.00 3rd Qu.:14.00 3rd Qu.:22.00
Max. :60.00 Max. :120.00 Max. :30.00 Max. :40.00
OG_Min OG_Max FG_Min FG_Max
Min. :1.026 Min. :1.032 Min. :0.998 Min. :1.006
1st Qu.:1.040 1st Qu.:1.052 1st Qu.:1.008 1st Qu.:1.012
Median :1.046 Median :1.060 Median :1.010 Median :1.015
Mean :1.049 Mean :1.065 Mean :1.009 Mean :1.016
3rd Qu.:1.056 3rd Qu.:1.075 3rd Qu.:1.010 3rd Qu.:1.018
Max. :1.080 Max. :1.130 Max. :1.020 Max. :1.040
These data give ranges of values for each style but it is probably easier if we just take the midpoint of the range for what we are going to be doing here. So let’s fix that.
data %>%
mutate( ABV=( ABV_Max+ABV_Min)/2,
IBU=( IBU_Max+IBU_Min)/2,
SRM=( SRM_Max+SRM_Min)/2,
OG=( OG_Max+OG_Min)/2,
FG=( FG_Max+FG_Min)/2 ) %>%
select( Styles, Yeast, ABV, IBU, SRM, OG, FG) -> beers
summary( beers)
Styles Yeast ABV IBU
Length:100 Ale :69 Min. : 2.850 Min. : 5.00
Class :character Either: 4 1st Qu.: 4.900 1st Qu.:21.38
Mode :character Lager :27 Median : 5.300 Median :26.25
Mean : 5.857 Mean :30.48
3rd Qu.: 6.750 3rd Qu.:37.50
Max. :11.500 Max. :90.00
SRM OG FG
Min. : 2.50 Min. :1.030 Min. :1.003
1st Qu.: 5.00 1st Qu.:1.047 1st Qu.:1.010
Median :12.75 Median :1.052 Median :1.012
Mean :13.79 Mean :1.057 Mean :1.013
3rd Qu.:18.00 3rd Qu.:1.065 3rd Qu.:1.014
Max. :35.00 Max. :1.100 Max. :1.029
m <- as.matrix( beers[,3:7])
rownames(m) <- beers$Styles
beer_dist <- dist( m )
h <- hclust( beer_dist )
h
Call:
hclust(d = beer_dist)
Cluster method : complete
Distance : euclidean
Number of objects: 100
plot(h, cex=0.5)
This is a bit confusing, so let’s play around with a little interactive interface to the data. This will not show up on the PDF version, you’ll have to look at the web version.
library( networkD3 )
grp_colors <- c("red","green","blue")[as.numeric(as.factor(beers$Yeast))]
dendroNetwork( h, height=2400,
zoom=TRUE,
textColour=grp_colors)
There are other interactive ways to look at your data. A good set of examples can be found at https://christophergandrud.github.io/networkD3/