12 Summary Statistics

In this secion we close out the exploring data section by defining measures of central tendency, dispersion, and explore a few ways to ordinate data.

For these data, we will use the iris data set as an example. This data set is built into R and is has the quantitative measurements for Sepal Width, Sepal Length, Petal Length, and Petal Width for 150 samples taken from individuals in three different species of Iris.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

12.1 Central Tendency

Central tendency is a term we use to define the center of gravity of some data. There are several measures of central tendency that we commonly use. In R, only the arithmetic mean is included by default, to estimate the remaining ones, we can install the DescTools package.

install.packages("DescTools")

Once installed, we need to load it into memory.

library(DescTools)

and we are ready to go.

Arithmetic Mean - The arithmetic mean is the ‘average’ that we use on the common vernacular and is defined as:

\[ \mu = \frac{1}{N} \sum_{i=1}^N x_i \]

In R we estimate it using the mean() function.

mean( iris$Sepal.Width )

## [1] 3.057333

We can apply this across groupings in our data using the by() function (as we’ve done previously) to measure the mean of measurements across groups of observations.

by( iris$Sepal.Width, iris$Species, mean )

## iris$Species: setosa
## [1] 3.428
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 2.77
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 2.974

Geometric Mean - The arithmetic mean is not the only way to measure the center of gravity of your data. The geometric mean is defined as the \(n^{th}\) root of the product of measurements.

\[ \mu_{g} = \left( \prod_{i=1}^N x_i \right)^{\frac{1}{N}} \]

library(DescTools)
Gmean(iris$Sepal.Width )

## [1] 3.026598

The geometric mean is often used to compare values that are multiples or exponential. For example, in the Iris data, it may be more appropriate to evaluate differences in sepal area using the geometric mean rather than the arithmetic one.

by( iris$Sepal.Width*iris$Sepal.Length, iris$Species, Gmean )

## iris$Species: setosa
## [1] 17.01442
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 16.27452
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 19.39149

Harmonic Mean - The harmonic mean is an approach commonly applied to fractional data (percentages) or data that has outliers. It is estimated as the recipricol of the arithmetic mean of reciprocals…

\[ \mu_{h} = \frac{1}{\frac{1}{N} \left( \frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_N} \right)} \]

The effects of outliers can be seen in the example:

x <- c(2,3,5,6,100)
mean(x)

## [1] 23.2

Hmean(x)

## [1] 4.132231

In our iris data, we can apply the Hmean as before across species

by( iris$Sepal.Width, iris$Species, Hmean )

## iris$Species: setosa
## [1] 3.385537
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 2.733011
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 2.940087

Median - The median is the center of the data, the value where half of the observations are larger and half are smaller—the 50\(^{th}\) quantile.

median( iris$Sepal.Width )

## [1] 3

by( iris$Sepal.Width, iris$Species, median )

## iris$Species: setosa
## [1] 3.4
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 2.8
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 3

This is a rank-based measure of central tendency and one that we saw in the Normality discussion last week.

12.2 Dispersion

In addition to the central tendency, for us to describe the data, we also need to know a bit about the dispersion of values around the center of gravity.

Range - The range is the physical separation between the smallest and largest values.

range( iris$Sepal.Width )

## [1] 2.0 4.4

by( iris$Sepal.Width, iris$Species, range )

## iris$Species: setosa
## [1] 2.3 4.4
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 2.0 3.4
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 2.2 3.8

It is identical to:

c( min(iris$Sepal.Width), max( iris$Sepal.Width))

## [1] 2.0 4.4

Sample Variance - The variance of the data, \(\sigma^2\), is defined as the average distance between the observations and the arithmetic mean of the observations. It is not quite the ‘average’ since we need to punish ourselves for estimation the mean as we’ve lost a degree of freedom. We will come back to this later in a more detailed way, just take it on faith right now. The idealized formula is:

\[ s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2 \]

which you should never use when using a computer to estimate the sample variance. There are some significant problems with round-off error that cause this theoretical formula to produce incorrect results.

In R we estimate the variance as:

var( iris$Sepal.Width )

## [1] 0.1899794

and can be applied across groups as:

by( iris$Sepal.Width, iris$Species, var)

## iris$Species: setosa
## [1] 0.1436898
## -------------------------------------------------------- 
## iris$Species: versicolor
## [1] 0.09846939
## -------------------------------------------------------- 
## iris$Species: virginica
## [1] 0.1040041

where we can see that I. setosa has more variation in sepal width than the other species.

Standard Deviation - The units on the variance are not same as the units for the original data, since it is the deviance squared. To interpret the dispersion in a way that is comparable, we take the square root of the variance - a term called the standard deviation. And while it has its own formula

sd( iris$Sepal.Width)

## [1] 0.4358663

it is identical to

sqrt( var( iris$Sepal.Width))

## [1] 0.4358663

It is common to use the standard deviation to designate some spread of the data around the mean using error bars. Here is an example using the barplot() and arrows() functions (n.b. I return the value of the x-coordinates from the bar plot function to use in the arrows function).

mu <- by( iris$Sepal.Width, iris$Species, mean )
b <- barplot(mu, ylim=c(0,4))
sepal.sd <- by( iris$Sepal.Width, iris$Species, sd )
x <- b
y0 <- as.numeric( mu )
y1 <- y0 + as.numeric( sepal.sd)
arrows( x,y0,x,y1, angle = 90 )

12.3 Ordination

There is a group of routines that are used to visualize and reformat data called “ordination.” These approaches are not necessarily analyses, they are more commonly used to understand the underlying data or to reduce the dimensionality of multivariate data.

In this section, we are going to explore a bit about principal component (PC) rotation. A PC rotation is one that takes the original columns of data and performs a rotation on the values to align onto new ‘synthetic’ axes. Consider the example in the next figure. Here, some bivariate data is plot in 2-space, though this can be done for much higher dimensions of data as well—in fact it is more beneficial with more columns of data and this can be used as a way of reducing the dimensionality of the data while loosing very little (or no) content. The axes of a PC rotation are taken as linear combinations of the existing axes and define a new coordinate set onto which the points are plot. All points are rigidly constrained to keep the same relationship and there is no loss of information. The PC axes are defined by determining the most variable stretch through the data. In the figure, we see the raw data plot onto the X- and Y-axes. The axis of highest variance does not align with either of the original ones, and instead can be defined as a combination of both X- and Y- coordinates.

If we take the blue axis as the first PC axis, the coordinate of the points would be taken along that new synthetic axis. The next PC axis is defined as being perpendicular to the previous one(s) and is identified as covering the largest variance in the data as before. This process continues until there are no more axes. In our case, the second axis would be at a right angle from the blue line (above). You can, at maximum, have as many PC axes as there are columns of data. However, the later axes may not explain significant portions of the underlying data, the process of rotating based upon axes of maximal variation may be able to capture the complete data set with fewer axes than the total set. This is where a technique like this may be helpful in reducing the dimensionality of the data as well as finding the ‘big trends’ that may exist in the data set.

To perform this rotation on the iris data, we use the princomp() function. Here we focus only on the numerical data (of course) as the factor is not something we can do this kind of rotation with.

fit.pca <- princomp(iris[,1:4], cor = TRUE)

Here are the first 8 (out of 50 potential) axes for the arapat data set.

summary(fit.pca)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3      Comp.4
## Standard deviation     1.7083611 0.9560494 0.38308860 0.143926497
## Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709
## Cumulative Proportion  0.7296245 0.9581321 0.99482129 1.000000000

This output has two important components to it. First, it shows the axes, in decreasing order of importance and how much of the total variation they describe. The first Comp.1 axis explains 72.9% of the variance, the second explains 22.8%, etc. Second, it shows the cumulative proportion of the variation explained. From the 4 axes we started with, we can explain 95.8% of the variance by using just the first two PC axes.

Where this becomes meaningful for us is in how we can project our original data onto these new coordinate locations and look at the distribution to see if there are any obvious trends, partitions, gradients, etc.

library(ggplot2)
pred <- predict(fit.pca)
df <- data.frame(PC1 = pred[, 1], PC2 = pred[, 2])
df$Species <- iris$Species
ggplot(df) + geom_point(aes(x = PC1, y = PC2, color = Species), size = 3, alpha = 0.75)

We can see from the plot that the the samples are clustered in an obvious way. The designation of ‘Species’ as depicted by the color of the points, shows definite partitions between I. setosa and the remaining species (along the PC1 axis). However, projected onto the PC2 axis, there does not seem to be a partitioning of the samples differentiating the species in an obvious way.

12.4 Hierarchical Clustering

In the previous section, we defined a new coordinate space for all samples in the data set. The rotation of the data was able to describe over 95% of the observed variation using only the first two PC axes. In this section, we are going to use the rotated coordinates to evaluate species-level differences using a hierarchical clustering method. Hierarchical clustering are very helpful in understanding groupings in the data, particularly if there is a ‘nesting’ structure. While there are many ways to do it, they all generally proceed as follows:
1. Define a numeric metric that measured the distances between all K groups.
2. Find the two groups that have the smallest distance and coalesce them together into a pair.
3. Assume that the coalesced pair now constitutes a single entity, estimate the numeric metric among all K-1 groups in the data set.
4. Go to #2 and repeat until you have coalesced all the groups together.

Here again, it is the data that is telling us how it is structured rather than us imposing a model onto the data to see if it fits.

To perform a clustering, we first need to start with a distance matrix based upon the original data.

iris.dist <- dist( iris[,1:4] )

We can then perform the above algorithm using the hclust() function.

h <- hclust( iris.dist )
plot( h )

Which shows a deep separation between groups. Lets see if we can get a bit more interpretive power out of it by changing the labels to show species:

h$labels <- iris$Species 
plot( h , cex=0.5)

You may need to ‘zoom’ that one a bit to understand the labels because I reduced their font size (the cex bit in the plot statement). This is interesting but would be helpful if we could provide a bit of color to the plot to differentiate the Species. Unfortunately, we cannot just add col=Species (for reasons I don’t quite understand) to the plot but there is a way to change the h object into one that can be colored. However, you must first install a new library if you don’t already have it, which I am guessing if this is the first time you’ve done some work using trees in R you have not. Install it in the normal fashion

install.packages("dendextend")

Then we can use it like

library( dendextend )
d <- as.dendrogram( h )
labels_colors( d ) <- as.numeric( iris$Species )
plot( d )

There are some interesting things to notice here.
- The green setosa group seems to be nicely separated from the rest, just like we saw in the PC plot. - There are some individual samples that are a bit mixed, the I. virginica and I. versicolor samples are mixed together not forming a coherent group.

These plot suggest that morphology along may be a good indicator of species differences for I. setosa but not perhaps that good for telling the differences between the I. virginica and I. versicolor samples.