Correlation

class: left, middle, inverse
background-image: url("https://live.staticflickr.com/65535/50559539697_1c35d0a56a_o_d.png")
background-size: cover

# .black[Models of Correlation]

&nbsp;

### .yellow-inline[.fancy[How do we know? 🤔]]

&nbsp;

---
class: middle, center

> ## Correlation is a test to see if two variables change in a coordinate fashion.  However, this does not imply they are functionally linked or causal in nature.

---

# Example Data

```r
df <- data.frame( Year = 1999:2009 )
df$`Nicolas Cage Movies` <- c( 2, 2, 2, 3, 1, 1, 2, 3, 4, 1, 4)
df$`Drowning Deaths in Pools` <- c( 109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102 ) 
head( df )
```

```
##   Year Nicolas Cage Movies Drowning Deaths in Pools
## 1 1999                   2                      109
## 2 2000                   2                      102
## 3 2001                   2                      102
## 4 2002                   3                       98
## 5 2003                   1                       85
## 6 2004                   1                       95
```

---

```r
library( reshape2 )
df %>%
  melt( id = "Year" ) %>%
  ggplot( aes( Year, value, color = variable) ) + 
  geom_line()  +  geom_point()  + 
  theme( legend.position = "top",legend.title = element_blank() )
```

---

```r
df %>%
  ggplot( aes(`Nicolas Cage Movies`, `Drowning Deaths in Pools` ) ) +
  geom_point() + stat_smooth( formula=y ~ x, method='lm', se=FALSE, color = "red", size = 0.5) +
  geom_text( aes(x=1.5, y=115, label = paste( "Correlation = ", format( cor(df[,2],df[,3]), digits=3) ) ), size=6 )
```

---

# The Correlation Test

```r
cor.test( df$`Nicolas Cage Movies`, df$`Drowning Deaths in Pools`)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  df$`Nicolas Cage Movies` and df$`Drowning Deaths in Pools`
## t = 2.6785, df = 9, p-value = 0.02527
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1101273 0.9045101
## sample estimates:
##       cor 
## 0.6660043
```

---
class: inverse, middle
background-image: linear-gradient(to bottom, rgba(8,45,85,0.9) 0%, rgba(8,45,85,0.9) 100%),url("https://live.staticflickr.com/65535/50583295071_4caf054046_c_d.jpg")
background-size: cover

# .yellow[New Data Set]

## .yellow[<svg style="height:0.8em;top:.04em;position:relative;fill:#FDD545;" viewBox="0 0 448 512"><path d="M368 96h-48V56c0-13.255-10.745-24-24-24H24C10.745 32 0 42.745 0 56v400c0 13.255 10.745 24 24 24h272c13.255 0 24-10.745 24-24v-42.11l80.606-35.977C429.396 365.063 448 336.388 448 304.86V176c0-44.112-35.888-80-80-80zm16 208.86a16.018 16.018 0 0 1-9.479 14.611L320 343.805V160h48c8.822 0 16 7.178 16 16v128.86zM208 384c-8.836 0-16-7.164-16-16V144c0-8.836 7.164-16 16-16s16 7.164 16 16v224c0 8.836-7.164 16-16 16zm-96 0c-8.836 0-16-7.164-16-16V144c0-8.836 7.164-16 16-16s16 7.164 16 16v224c0 8.836-7.164 16-16 16z"/></svg>]

---

# Beer Judge Certification Program

.pull-left[ ### Recognized Beer Styles

- 100 Distinct Styles (not just IPA's & that yellow American Corn Lager!)

- Global & Regional Styles

- Quantitative Characteristics
  - IBU, SRM, ABV, OG, FG
  
- Qualitative Characteristics 
  - Overall Impression, Aroma, Appearance, Flavor, & Mouthfeel

]
.pull-right[.center[![](https://live.staticflickr.com/65535/50570218057_8a1887e597_w_d.jpg)]]

&nbsp;

The [BJCP Style Guidelines](https://bjcp.org/stylecenter.php) exist for beer, mead, and ciders.

---

# Basic Yeast Types

.center[
![](https://live.staticflickr.com/65535/50569334528_f5152c320b_c_d.jpg)
]

Not including sour beers, which use yeast and bacteria mixtures.

---

# Dissolved Sugars - OG & FG

.pull-left[The more sugar in the wort, the more food for the yeast to work on, and the more alcohol that may be produced.

The difference between the gravities before and after fermentation can be used to estimate ABV.
]
.pull-right[

&nbsp;

![](https://live.staticflickr.com/65535/50570081896_82bacc922e_w_d.jpg)

]

---

# Bitterness

Bitterness is created by the addition of herbs.

.center[![](https://live.staticflickr.com/65535/50569334583_3f3bbae387_w_d.jpg)]

---

# Color

The color of the beer is quantified using the Standard Reference Method (SRM) scale.

.center[![](https://live.staticflickr.com/65535/50570214697_6a57066d3b_w_d.jpg)]

---

# The Data

Here are the raw characteristic data for the different styles.

```r
beer_url <- "https://raw.githubusercontent.com/dyerlab/ENVS-Lectures/master/data/Beer_Styles.csv"
read_csv( beer_url ) %>%
  mutate( Yeast = factor(Yeast) ) -> beer
summary( beer )
```

```
##     Styles             Yeast       ABV_Min         ABV_Max          IBU_Min         IBU_Max          SRM_Min         SRM_Max     
##  Length:100         Ale   :69   Min.   :2.400   Min.   : 3.200   Min.   : 0.00   Min.   :  8.00   Min.   : 2.00   Min.   : 3.00  
##  Class :character   Either: 4   1st Qu.:4.200   1st Qu.: 5.475   1st Qu.:15.00   1st Qu.: 25.00   1st Qu.: 3.50   1st Qu.: 7.00  
##  Mode  :character   Lager :27   Median :4.600   Median : 6.000   Median :20.00   Median : 35.00   Median : 8.00   Median :17.00  
##                                 Mean   :4.947   Mean   : 6.768   Mean   :21.97   Mean   : 38.98   Mean   : 9.82   Mean   :17.76  
##                                 3rd Qu.:5.500   3rd Qu.: 8.000   3rd Qu.:25.00   3rd Qu.: 45.00   3rd Qu.:14.00   3rd Qu.:22.00  
##                                 Max.   :9.000   Max.   :14.000   Max.   :60.00   Max.   :120.00   Max.   :30.00   Max.   :40.00  
##      OG_Min          OG_Max          FG_Min          FG_Max     
##  Min.   :1.026   Min.   :1.032   Min.   :0.998   Min.   :1.006  
##  1st Qu.:1.040   1st Qu.:1.052   1st Qu.:1.008   1st Qu.:1.012  
##  Median :1.046   Median :1.060   Median :1.010   Median :1.015  
##  Mean   :1.049   Mean   :1.065   Mean   :1.009   Mean   :1.016  
##  3rd Qu.:1.056   3rd Qu.:1.075   3rd Qu.:1.010   3rd Qu.:1.018  
##  Max.   :1.080   Max.   :1.130   Max.   :1.020   Max.   :1.040
```

---
class: middle, center

# .orange[Field Trip!]

---
class: inverse, sectionTitle

# .yellow[Specifics About Statistics]

---

# Parameters & Estimators

.pull-left[
### Parameter

Estimating the real value that is created by the the **entire** popualtion of entities.
- Mean of the `real` population, `$\mu$`.

- Variance of the `real` population, `$\sigma^2$`
]

--
.pull-right[
### Estimators

The values we get by **sampling** from the much larger population to gain inferences

- Sample mean, `$\bar{x}$`

- Sample variance, `$s^2$`
]

---

# Parametric Assumptions

Much of the way we determine the significance of a model is based upon *assumptions* of the underlying data.

- Normality

- Independence

- Homoscedasticity

---

# Normality

.pull-left[

The data, or derivatives thereof, are assumed to be able to be quanitfied by a Normal distribution `$N(\mu,\sigma)$`.

The normal distribution is defined as:

$$
f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})}
$$

where `$\mu$` and `$\sigma$` are the true value of the underlying mean and standard deviation.
]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" />

]

---
# Testing for Normality - Visualization

```r
qqnorm( beer$ABV_Min )
qqline( beer$ABV_Min, col="red")
```

---

# Statistical Tests for Normality - Shapiro Wilkes Test

The null hypothesis `$H_O: Data\;is\;normal$` is evaluated by estimating the statistic, `$W$`, defined as:

`$W = \frac{\left(\sum_{i=1}^Na_iR_{x_i}\right)^2}{\sum_{i=1}^N(x_i - \bar{x})^2}$`

where `$N$` is the number of samples, `$a_i$` is a standardizing coeeficient, `$x_i$` is the `$i^{th}$` value of `$x$`, `$\bar{x}$` is the mean of the observed values, and `$R_{x_i}$` is the rank of the `$x_i^{th}$` observation.

```r
shapiro.test( beer$ABV_Min )
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  beer$ABV_Min
## W = 0.94595, p-value = 0.0004532
```

---

#  Data Transformations - ArcSin Square Root

Fractions and Percentages are known  to behave poorly, particularly around the edges (e.g., close to 0 or 1).  It is not uncommon to use a simple ArcSin Square Root transformation to try to help fractional data.

```r
abv <- beer$ABV_Min / 100.0
asin( sqrt( abv ) ) -> abv.1
shapiro.test( abv.1)
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  abv.1
## W = 0.96746, p-value = 0.01418
```

`Conclusion? Are these data normal?`

---

# Data Transformations - Box Cox

\[
\tilde{x} = \frac{x^\lambda - 1}{\lambda}
\]

---

# Data Transformations - Box Cox

```r
test_boxcox <- function( x, lambdas = seq(-1.1, 1.1, by = 0.015) ) {
  ret <- data.frame( Lambda = lambdas,
                     W = NA,
                     P = NA)
  
  for( lambda in lambdas ) {
    x.tilde <- (x^lambda - 1) / lambda   
    w <- shapiro.test( x.tilde )
    ret$W[ ret$Lambda == lambda ] <- w$statistic
    ret$P[ ret$Lambda == lambda ] <- w$p.value
  }
  
  return( ret )
}
```

```r
vals <- test_boxcox( beer$ABV_Min ) 
summary( vals )
```

```
##      Lambda              W                P            
##  Min.   :-1.1000   Min.   :0.9157   Min.   :8.410e-06  
##  1st Qu.:-0.5525   1st Qu.:0.9481   1st Qu.:6.228e-04  
##  Median :-0.0050   Median :0.9621   Median :5.666e-03  
##  Mean   :-0.0050   Mean   :0.9576   Mean   :1.343e-02  
##  3rd Qu.: 0.5425   3rd Qu.:0.9708   3rd Qu.:2.548e-02  
##  Max.   : 1.0900   Max.   :0.9738   Max.   :4.352e-02
```

---

# Data Transformations - Box Cox

.pull-left[
<img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

```r
vals[ which(vals$P == max( vals$P)),]
```

```
##    Lambda        W          P
## 82  0.115 0.973805 0.04351988
```
]

---

# Equality of Variance

It is assumed that the variance of the data are

---

# Independence of Data

.pull-left[The samples you collect, and the way that you design your experiments are most important to ensure that your data are individually independent.  You need to think about this very carefully as you design your experiments.]

.pull-right[![](https://live.staticflickr.com/65535/50582673933_4db6638924_w_d.jpg)]

---
class: inverse, sectionTitle

# .yellow[Correlation Models]

---

# Parametric Correlation
### The Pearson Product Moment Correlation

`$\rho = \frac{\sum_{i=1}^N(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^N(y_i - \bar{y})^2}}$`

whose values are confined to be within the range `$-1.0 \le \rho \le +1.0$`

Whose significance is tested by using a variant of the `t.test`:

`$t = r \frac{N-1}{1-r^2}$`

---
# Visual Examples

.center[
![*Figure 1: Data and associated correlation statistics.*](https://live.staticflickr.com/65535/50569436828_a110515a21_c_d.jpg)
]

---
# The Correlation Test

```r
cor.test( beer$OG_Max, beer$FG_Max ) -> OG.FG.pearson
 OG.FG.pearson
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  beer$OG_Max and beer$FG_Max
## t = 15.168, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7671910 0.8878064
## sample estimates:
##       cor 
## 0.8374184
```

Note, the object that is returned has all the components within it.

```r
names( OG.FG.pearson )
```

```
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "alternative" "method"      "data.name"   "conf.int"
```

---

# Non-Parametric Correlation - Spearman's Rho

To alleviate some of the underlying parametric assumptions, we can use ranks of the data instead of the raw data directly.

`$\rho_{Spearman} = \frac{ \sum_{i=1}^N(R_{x_i} - \bar{R_{x}})(R_{y_i} - \bar{R_{y}})}{\sqrt{\sum_{i=1}^N(R_{x_i} - \bar{R_{x}})^2}\sqrt{\sum_{i=1}^N(R_{y_i} - \bar{R_{y}})^2}}$`

```r
OG.FG.spearman <- cor.test( beer$OG_Max, beer$FG_Max, 
                            method = "spearman" )
OG.FG.spearman
```

```
## 
## 	Spearman's rank correlation rho
## 
## data:  beer$OG_Max and beer$FG_Max
## S = 39257, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7644328
```

---

# Permutation for Significance.

Another way to circumvent some of the constraints based upon the form of the data, we can use permutation to test significance.

`$H_O: \rho = 0$`

---

# Setting up Permutations

```r
x <- beer$OG_Max
y <- beer$FG_Max
df <- data.frame( Estimate = factor( c( "Original",
                                        rep("Permuted", 999))), 
                  rho =  c( cor.test( x, y )$estimate,
                            rep(NA, 999)) )

summary( df )
```

```
##      Estimate        rho        
##  Original:  1   Min.   :0.8374  
##  Permuted:999   1st Qu.:0.8374  
##                 Median :0.8374  
##                 Mean   :0.8374  
##                 3rd Qu.:0.8374  
##                 Max.   :0.8374  
##                 NA's   :999
```

---
# Permuting "Under the Null"

Now, we can go through the 999 `NA` values we put into that data frame and:  
1. Permute one of the variables
2. Run the analysis  
3. Store the statistic.

```r
for( i in 2:1000) {
  yhat <- sample( y,   # this shuffles the data in y
                  size = length(y), 
                  replace = FALSE)
  model <- cor.test( x, yhat )
  df$rho[i] <- model$estimate 
}

summary( df )
```

```
##      Estimate        rho            
##  Original:  1   Min.   :-0.2892758  
##  Permuted:999   1st Qu.:-0.0730548  
##                 Median :-0.0073230  
##                 Mean   : 0.0002596  
##                 3rd Qu.: 0.0670440  
##                 Max.   : 0.8374184
```

---

# Visualizing the NULL

```r
ggplot( df ) + geom_histogram( aes(rho, fill=Estimate ) )
```

---

class: middle
background-image: url("images/contour.png")
background-position: right
background-size: auto

.center[

# Questions?

![Peter Sellers](https://live.staticflickr.com/65535/50382906427_2845eb1861_o_d.gif+)
]

.bottom[ If you have any questions for about the content presented herein, please feel free to [submit them to me](mailto://rjdyer@vcu.edu) and I'll get back to you as soon as possible.]