Regression

class: left, middle, inverse
background-image: url("https://live.staticflickr.com/65535/50559539697_1c35d0a56a_o_d.png")
background-size: cover

# .black[Regression Models <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#FDD545;overflow:visible;position:relative;"><path d="M496 384H64V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-32c0-8.84-7.16-16-16-16zM464 96H345.94c-21.38 0-32.09 25.85-16.97 40.97l32.4 32.4L288 242.75l-73.37-73.37c-12.5-12.5-32.76-12.5-45.25 0l-68.69 68.69c-6.25 6.25-6.25 16.38 0 22.63l22.62 22.62c6.25 6.25 16.38 6.25 22.63 0L192 237.25l73.37 73.37c12.5 12.5 32.76 12.5 45.25 0l96-96 32.4 32.4c15.12 15.12 40.97 4.41 40.97-16.97V112c.01-8.84-7.15-16-15.99-16z"/></svg> ]

&nbsp;

### .yellow[.fancy[Linear Models & <br>Model Comparison]]

&nbsp;

---
# Linear Models

$$
y = \beta_0 + \beta_1 x_1 + \epsilon 
$$

.pull-left[
The basic linear model has:  
- An intercept ( `$\beta_0$` ),

- A slope coefficient ( `$\beta_1$` ), and

- And an error term ( `$\epsilon$` ).
]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" />
]

---

# Building A Model - Random Search

.pull-left[

```r
models <- data.frame( beta0 = runif(250,-20,40),
                      beta1 = runif(250, -5, 5))
summary( models )
```

```
##      beta0             beta1         
##  Min.   :-19.840   Min.   :-4.97599  
##  1st Qu.: -4.481   1st Qu.:-2.60468  
##  Median : 12.735   Median :-0.01890  
##  Mean   : 10.938   Mean   :-0.03569  
##  3rd Qu.: 26.349   3rd Qu.: 2.28999  
##  Max.   : 39.846   Max.   : 4.97482
```
]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" />

]

---

# Building A Model - Search Criterion

.center[
<img src="slides_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]

---

# Searching Random Model Space

```r
model_distance <- function( interscept, slope, X, Y ) {
  yhat <- interscept + slope * X
  diff <- Y - yhat
  return( sqrt( mean( diff ^ 2 ) ) )
}
```

```r
models$dist <- NA
for( i in 1:nrow(models) ) {
  models$dist[i] <- model_distance( models$beta0[i],
                                    models$beta1[i],
                                    df$x,
                                    df$y )
}
head( models )
```

```
##       beta0     beta1      dist
## 1 23.557176  4.315708 17.133718
## 2 25.605363 -3.244048 29.759245
## 3 11.118244 -4.251429 48.500099
## 4  4.939457  1.705851 18.361322
## 5  9.418283  3.570657  6.474642
## 6 36.613684 -2.834294 19.691697
```

---

# Top 10 Random Models

.pull-left[
<img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

The 10 best models (filtering in data= inside a `geom_abline()`) with original points.

```r
ggplot()  + 
  geom_abline( aes(intercept = beta0,
                   slope = beta1, 
                   color = -dist),
               data = filter( models, rank(dist) <= 10 ),
               alpha = 0.5) + 
  geom_point( aes(x,y),
              data=df) 
```
]

---

# The Best Coefficients

```r
ggplot( models, aes(x = beta0, y = beta1, color = -dist)) + 
  geom_point( data = filter( models, rank(dist) <= 10), color = "red",  size = 4) + geom_point()
```

---

# Systematic Grid Search

.pull-left[
<img src="slides_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

```r
grid <- expand.grid( beta0 = seq(15,20, length = 25),
                     beta1 = seq(2, 3, length = 25))
grid$dist <- NA
for( i in 1:nrow(grid) ) {
  grid$dist[i] <- model_distance( grid$beta0[i],
                                  grid$beta1[i],
                                  df$x,
                                  df$y )
}

ggplot( grid, aes(x = beta0, 
                  y = beta1,
                  color = -dist)) + 
  geom_point( data = filter( grid, rank(dist) <= 10), 
              color = "red",
              size = 4) +
  geom_point()
```
]

---
class: inverse, sectionTitle

# .yellow[Our Friend `lm()`]

## .fancy[Linear Models]

---
class: center, middle

![](https://live.staticflickr.com/65535/50588297022_62f043a616_c_d.jpg)

---

# Specifying a Formula

*Single Predictor Model*

```
y ~ x
```

*Multiple Additive Predictors*

```
y ~ x1 + x2 
```

*Interaction Terms*

```
y ~ x1 + x2 + x1*x2
```

---

# Fitting A Model

```r
fit <- lm( y ~ x, data = df )
fit 
```

```
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Coefficients:
## (Intercept)            x  
##      17.280        2.625
```

---

# Model Summaries

```r
summary( fit )
```

```
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9836 -4.0182 -0.8709  5.3064  6.9909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   17.280      4.002   4.318  0.00255 **
## x              2.626      0.645   4.070  0.00358 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.859 on 8 degrees of freedom
## Multiple R-squared:  0.6744,	Adjusted R-squared:  0.6337 
## F-statistic: 16.57 on 1 and 8 DF,  p-value: 0.003581
```

---

# Components within the Summary Object

```r
names( summary( fit ) )
```

```
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
```

The probability can be found by looking at the data in the `F-Statistic` and then asking the F-distribution for the probability associated with the value of the test statistic and the degrees of freedom for both the model and the residuals.

```r
summary( fit )$fstatistic 
```

```
##    value    numdf    dendf 
## 16.56838  1.00000  8.00000
```

```r
get_pval <- function( model ) {
  f <- summary( model )$fstatistic[1]
  df1 <- summary( model )$fstatistic[2]
  df2 <- summary( model )$fstatistic[3]
  p <- as.numeric( 1.0 - pf( f, df1, df2 ) )
  return( p  )
}

get_pval( fit )
```

```
## [1] 0.0035813
```

---

# Model Diagnostics - Residuals

```r
plot( fit, which = 1 )
```

---

# Normality Of the Data

```r
plot( fit, which = 2 )
```

---

# Leverage

```r
plot( fit, which=5 )
```

---

# Decomposition of Variance

The terms in this table are:

- Degrees of Freedom (*df*): representing `1` degree of freedom for the model, and `N-1` for the residuals.

- Sums of Squared Deviations: 
    - `$SS_{Total} = \sum_{i=1}^N (y_i - \bar{y})^2$`
    - `$SS_{Model} = \sum_{i=1}^N (\hat{y}_i - \bar{y})^2$`, and 
    - `$SS_{Residual} = SS_{Total} - SS_{Model}$`
    
- Mean Squares (Standardization of the Sums of Squares for the degrees of freedom)  
    - `$MS_{Model} = \frac{SS_{Model}}{df_{Model}}$`
    - `$MS_{Residual} = \frac{SS_{Residual}}{df_{Residual}}$`
    
- The `$F$`-statistic is from a known distribution and is defined by the ratio of Mean Squared values.

- `Pr(>F)` is the probability associated the value of the `$F$`-statistic and is dependent upon the degrees of freedom for the model and residuals.

---

# Decomposition of Variance

```r
anova( fit )
```

```
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x          1 568.67  568.67  16.568 0.003581 **
## Residuals  8 274.58   34.32                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Variance Exaplained

```r
summary( fit )
```

---

# Relationship Between `$R^2$` & `$r$`

How much of the variation is explained?

`$$R^2 = \frac{SS_{Model}}{SS_{Total}}$$`

```r
c( `Regression R^2` = summary( fit )$r.squared,
   `Squared Correlation` = as.numeric( cor.test( df$x, df$y )$estimate^2 ) )
```

```
##      Regression R^2 Squared Correlation 
##           0.6743782           0.6743782
```

> The square of the Pearson Correlation is equal to R

---

## Helper Functions

.pull-left[

Grabbing the predicted values `$\hat{y}$` from the model.

```r
predict( fit ) -> yhat 
yhat
```

```
##        1        2        3        4        5        6        7        8 
## 19.90545 22.53091 25.15636 27.78182 30.40727 33.03273 35.65818 38.28364 
##        9       10 
## 40.90909 43.53455
```

```r
plot( yhat ~ df$x, type='l', bty="n", col="red" )
```

]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" />
]

---

# Helper Functions - Residuals

.pull-left[

The residuals are the distances between the observed value and its corresponding value on the fitted line.

```r
residuals( fit ) 
```

```
##          1          2          3          4          5          6          7 
## -4.4054545  5.5690909 -2.8563636  4.5181818  0.6927273 -6.2327273  6.1418182 
##          8          9         10 
## -7.9836364  6.9909091 -2.4345455
```
]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" />

]

---
class: inverse, sectionTitle

# .yellow[Comparing Models]

---

# What Makes One Model Better

There are two parameters that we have already looked at that may help.  These are:

- The `$P-value$`: Models with smaller probabilities could be considered more informative.

- The `$R^2$`: Models that explain more of the variation may be considered more informative.

Let's start by looking at some airquality data we have played with previously when working on [data.frame objects](https://dyerlab.github.io/ENVS-Lectures/r_language/data_frames/homework.nb.html).

```r
airquality %>%
  select( -Month, -Day ) -> df.air
summary( df.air )
```

```
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7
```

---

# Base Models - What Influences Ozone

```r
fit.solar <- lm( Ozone ~ Solar.R, data = df.air )
fit.temp <- lm( Ozone ~ Temp, data = df.air )
fit.wind <- lm( Ozone ~ Wind, data = df.air )
```

---

# More Complicated Models

Multiple Regression Model - Including more than one predictors.

`$y = \beta_0 + \beta_1 x_1 + beta_2 x_2 + \epsilon$`

```r
fit.temp.wind <- lm( Ozone ~ Temp + Wind, data = df.air )
fit.temp.solar <- lm( Ozone ~ Temp + Solar.R, data = df.air )
fit.wind.solar <- lm( Ozone ~ Wind + Solar.R, data = df.air )
```

<table class=" lightable-classic-2" style='font-family: "Arial Narrow", "Source Sans Pro", sans-serif; margin-left: auto; margin-right: auto;'>
<caption>Model parameters predicting mean ozone in parts per billion mresured in New York during the period of 1 May 2973 - 30 September 2973.</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:left;"> R2 </th>
   <th style="text-align:left;"> P </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ozone ~ Solar </td>
   <td style="text-align:left;"> 0.121 </td>
   <td style="text-align:left;"> 1.79e-04 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp </td>
   <td style="text-align:left;"> 0.488 </td>
   <td style="text-align:left;"> 0.00e+00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind </td>
   <td style="text-align:left;"> 0.362 </td>
   <td style="text-align:left;"> 9.27e-13 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind </td>
   <td style="text-align:left;"> 0.569 </td>
   <td style="text-align:left;"> 0.00e+00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Solar </td>
   <td style="text-align:left;"> 0.510 </td>
   <td style="text-align:left;"> 0.00e+00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind + Solar </td>
   <td style="text-align:left;"> 0.449 </td>
   <td style="text-align:left;"> 9.99e-15 </td>
  </tr>
</tbody>
</table>

---

# For Completeness

How about all the predictors. `$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$`

```r
fit.all <- lm( Ozone ~ Solar.R + Temp + Wind, data = df.air )
```

<table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Model </th>
   <th style="text-align:left;"> R2 </th>
   <th style="text-align:left;"> P </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ozone ~ Solar </td>
   <td style="text-align:left;"> <span style="     color: black !important;">1.21e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">1.79e-04</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp </td>
   <td style="text-align:left;"> <span style="     color: black !important;">4.88e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;">0.00e+00</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind </td>
   <td style="text-align:left;"> <span style="     color: black !important;">3.62e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">9.27e-13</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind </td>
   <td style="text-align:left;"> <span style="     color: black !important;">5.69e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;">0.00e+00</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Solar </td>
   <td style="text-align:left;"> <span style="     color: black !important;">5.10e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;">0.00e+00</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind + Solar </td>
   <td style="text-align:left;"> <span style="     color: black !important;">4.49e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: black !important;">9.99e-15</span> </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar </td>
   <td style="text-align:left;"> <span style="     color: green !important;">6.06e-01</span> </td>
   <td style="text-align:left;"> <span style="     color: red !important;">0.00e+00</span> </td>
  </tr>
</tbody>
</table>

---

## `$R^2$` Inflation

Any variable added to a model will be able to generate *Sums of Squares* (even if it is a small amount).  So, `adding variables may artifically inflate the Model Sums of Squares`.

Example:

> What happens if I add just random data to the regression models?  How does `$R^2$` change?

---

# Random Data Effects

.pull-left[

]

.pull-right[

]

---

# Perfect - My Models RULE

#### I can just add **random** variables to my model and always get an .redinline[awesome] fit!

.center[
<iframe src="https://giphy.com/embed/7ymcoEE72hEf6" width="480" height="225" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>
]

.orangeinline[Not so fast Bevis.]

---

# Model Comparisons

Akaike Information Criterion (AIC) is a measurement that allows us to compare models while penalizing for adding new parameters.

`$AIC = -2 \ln L + 2p$`

The criterion here are to find models with the lowest AIC values.

## Comparisons

To compare, we evaluate the differences in AIC for alternative models.

`$\delta AIC = AIC - min( AIC )$`

---

# AIC & ∂AIC

.pull-left[
<table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Models </th>
   <th style="text-align:right;"> R2 </th>
   <th style="text-align:right;"> AIC </th>
   <th style="text-align:right;"> deltaAIC </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp </td>
   <td style="text-align:right;"> 0.488 </td>
   <td style="text-align:right;"> 1067.706 </td>
   <td style="text-align:right;"> 68.989 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind </td>
   <td style="text-align:right;"> 0.362 </td>
   <td style="text-align:right;"> 1093.187 </td>
   <td style="text-align:right;"> 94.470 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Solar </td>
   <td style="text-align:right;"> 0.121 </td>
   <td style="text-align:right;"> 1083.714 </td>
   <td style="text-align:right;"> 84.997 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind </td>
   <td style="text-align:right;"> 0.569 </td>
   <td style="text-align:right;"> 1049.741 </td>
   <td style="text-align:right;"> 51.024 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Solar </td>
   <td style="text-align:right;"> 0.510 </td>
   <td style="text-align:right;"> 1020.820 </td>
   <td style="text-align:right;"> 22.103 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Wind + Solar </td>
   <td style="text-align:right;"> 0.449 </td>
   <td style="text-align:right;"> 1033.816 </td>
   <td style="text-align:right;"> 35.098 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar </td>
   <td style="text-align:right;"> 0.606 </td>
   <td style="text-align:right;"> 998.717 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Models </th>
   <th style="text-align:right;"> R2 </th>
   <th style="text-align:right;"> AIC </th>
   <th style="text-align:right;"> deltaAIC </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 1 Random Variables </td>
   <td style="text-align:right;"> 0.606 </td>
   <td style="text-align:right;"> 1000.701 </td>
   <td style="text-align:right;"> 1.983 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 2 Random Variables </td>
   <td style="text-align:right;"> 0.618 </td>
   <td style="text-align:right;"> 999.382 </td>
   <td style="text-align:right;"> 0.665 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 3 Random Variables </td>
   <td style="text-align:right;"> 0.618 </td>
   <td style="text-align:right;"> 1001.151 </td>
   <td style="text-align:right;"> 2.434 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 4 Random Variables </td>
   <td style="text-align:right;"> 0.620 </td>
   <td style="text-align:right;"> 1002.593 </td>
   <td style="text-align:right;"> 3.876 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 5 Random Variables </td>
   <td style="text-align:right;"> 0.624 </td>
   <td style="text-align:right;"> 1003.503 </td>
   <td style="text-align:right;"> 4.785 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 6 Random Variables </td>
   <td style="text-align:right;"> 0.628 </td>
   <td style="text-align:right;"> 1004.413 </td>
   <td style="text-align:right;"> 5.696 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ozone ~ Temp + Wind + Solar + 7 Random Variables </td>
   <td style="text-align:right;"> 0.633 </td>
   <td style="text-align:right;"> 1004.822 </td>
   <td style="text-align:right;"> 6.105 </td>
  </tr>
</tbody>
</table>
]

---
class: inverse, sectionTitle

# .yellow[Stepwise Regression]

---
# Fitting Several Features

What if we have 10 predictor variables and are interested in fitting the `best` model des

.pull-left[
### Ap
]

.pull-right[

]

---

class: middle
background-position: right
background-size: auto

.center[

# Questions?

![Peter Sellers](https://live.staticflickr.com/65535/50382906427_2845eb1861_o_d.gif+)
]

.bottom[ If you have any questions for about the content presented herein, please feel free to [submit them to me](mailto://rjdyer@vcu.edu) and I'll get back to you as soon as possible.]