class: left, middle, inverse background-image: url("https://live.staticflickr.com/65535/50559539697_1c35d0a56a_o_d.png") background-size: cover # .orange[Analysis of Variance ] ### .orange[.fancy[Testing Differences of Means]
] --- # Testing for Specific Values In general, many times we are interested in testing to see if some samples have a particular value. Examples may include: - `\(H_O: \mu = c\)` The mean of the data is equal to a specific value. - `\(H_O: \mu_1 = \mu_2\)` These two samples have the same mean. - `\(H_O: \mu_1 = \mu_2 = \mu_3 ... \mu_k\)`: The mean of all `\(k\)` values are all the same. --- # Testing a Single Sample Mean: 1-Sample Tests The 1-sample `t.test()` defined by the null hypothesis `\(H_O: \mu = c\)` is where we are testing to see if this sample `\(t =\frac{\bar{x}-\mu}{s_{\bar{x}}}\)` --- # Degrees of Freedom & Rejection Regions .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### Asymptotic Nature of Samples As `\(df \to \infty\)` then `\(PDF(t) \to Normal\)` Which is defined (in such elegance) as: `\(P(t|x,v)= \frac{ \Gamma\left( \frac{v+1}{2}\right)}{\sqrt{v\pi}\Gamma\left( \frac{v}{2}\right)} \left( 1 + \frac{x^2}{v}\right)^{-\frac{v+1}{2}}\)` ] --- # Rejection Regions .pull-left[ ### The Rejection Of `\(H_O\)` Two Tailed Test: `\(H_A: \mu \ne c\)` One Tailed Test: `\(H_A: \mu \lt c\)` One Tailed Test: `\(H_A: \mu \gt c\)` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, sectionTitle # .yellow[One-Sample Tests] --- # The Iris Data (Again) .pull-left[ ```r iris %>% group_by( Species ) %>% summarize( SL.m = mean( Sepal.Length), SL.sd = sd( Sepal.Length ) ) %>% mutate( `Sepal Length` = paste( SL.m, "+/-", format( SL.sd, digits=3) , sep=" " ) ) %>% mutate( Species = paste( "I.",Species) ) %>% select( Species, `Sepal Length` ) %>% kable( digits=3) ``` <table> <thead> <tr> <th style="text-align:left;"> Species </th> <th style="text-align:left;"> Sepal Length </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> I. setosa </td> <td style="text-align:left;"> 5.006 +/- 0.352 </td> </tr> <tr> <td style="text-align:left;"> I. versicolor </td> <td style="text-align:left;"> 5.936 +/- 0.516 </td> </tr> <tr> <td style="text-align:left;"> I. virginica </td> <td style="text-align:left;"> 6.588 +/- 0.636 </td> </tr> </tbody> </table> ] .pull-right[ .center[ ![Iris](https://live.staticflickr.com/65535/50609360207_92d21b056f_w_d.jpg) ] ] --- # Iris Example: 1-Sample Test Setosa .pull-left[ `\(H_O: Iris\;setosa\;Sepal\;Length = 6.0\)` ```r iris %>% filter( Species == "setosa" ) %>% select( Sepal.Length ) %>% t.test( mu=6 ) ``` ``` ## ## One Sample t-test ## ## data: . ## t = -19.94, df = 49, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 6 ## 95 percent confidence interval: ## 4.905824 5.106176 ## sample estimates: ## mean of x ## 5.006 ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Iris Example: 1-Sample Test Versicolor .pull-left[ `\(H_O: Iris\;versicolor\;Sepal\;Length = 6.0\)` ```r iris %>% filter( Species == "versicolor" ) %>% select( Sepal.Length ) %>% t.test( mu=6 ) ``` ``` ## ## One Sample t-test ## ## data: . ## t = -0.87674, df = 49, p-value = 0.3849 ## alternative hypothesis: true mean is not equal to 6 ## 95 percent confidence interval: ## 5.789306 6.082694 ## sample estimates: ## mean of x ## 5.936 ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Iris Example: 1-Sample Test Virginica .pull-left[ `\(H_O: Iris\;virginica\;Sepal\;Length = 6.0\)` ```r iris %>% filter( Species == "virginica" ) %>% select( Sepal.Length ) %>% t.test( mu=6 ) ``` ``` ## ## One Sample t-test ## ## data: . ## t = 6.5386, df = 49, p-value = 3.441e-08 ## alternative hypothesis: true mean is not equal to 6 ## 95 percent confidence interval: ## 6.407285 6.768715 ## sample estimates: ## mean of x ## 6.588 ``` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Confidence Intervals. ```r t <- iris %>% filter( Species == "virginica" ) %>% select( Sepal.Length ) %>% t.test( mu=6 ) names(t) ``` ``` ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "stderr" "alternative" "method" "data.name" ``` -- The confidence interval itself is defined by the underlying t-distribution as: `\(\bar{x} - t_{\alpha, df} s_{\bar{x}} < \mu < \bar{x} + t_{\alpha, df} s_{\bar{x}}\)` ```r t$conf.int ``` ``` ## [1] 6.407285 6.768715 ## attr(,"conf.level") ## [1] 0.95 ``` --- class: inverse, sectionTitle # .yellow[Two-Sample Tests] --- # Evaluating Equality of 2 Means .pull-left[ ### Null Hypothesis `\(H_O: \mu_{versicolor} = \mu_{virginica}\)` Which is estimated by the statistic. `\(t = \frac{\bar{x}_1 - \bar{x}_2}{s_{\bar{x}_1-\bar{x}_2}}\)` where `\(s_{\bar{x}_1-\bar{x}_2}\)` is referred to as the *pooled variance* and is defined as: `\(s_{\bar{x}_1-\bar{x}_2} = \sqrt{ \frac{s_1^2}{N_1}+\frac{s_2^2}{N}}\)` ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Testing Equality of 2 Means ```r x <- iris %>% filter( Species == "versicolor") %>% select( Sepal.Length ) y <- iris %>% filter( Species == "virginica") %>% select( Sepal.Length ) t.test( x, y ) ``` ``` ## ## Welch Two Sample t-test ## ## data: x and y ## t = -5.6292, df = 94.025, p-value = 1.866e-07 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.8819731 -0.4220269 ## sample estimates: ## mean of x mean of y ## 5.936 6.588 ``` --- # Paired t-tests If the data are collected in a way that there is supposed to be a *connection* between the values of `x` and `y` (e.g., looking at fluctuating asymeetry in wing shape in individual birds where left wing = right wing), you could instead treat the two data as: `\(H_O: \mu_{x} - \mu_{y} = 0.0\)` and can be evaluated as: ```r t.test(x, y, paired = TRUE) ``` --- class: inverse, sectionTitle # .yellow[Testing Equality of Many Means] --- # Why Not All-Pairs t-tests? -- `\(H_O: Not\;Pregnant\)` .center[ ![](https://live.staticflickr.com/65535/50609390796_c9445f28fe_c_d.jpg) ] .pull-left[Rejecting of a true null hypothesis `false positive`] .pull-right[Non-rejection of a false null hypothesis, `false negative`] --- # Multiple Testing Problem The rejection region as `\(\alpha\)` is defined as the fraction of the underlying distribution that we will consider as rare enough to reject `\(H_O\)`. The probability of rejecting `\(H_O\)` incorrectly is `\(\alpha\)` and the probability of not rejecting `\(H_O\)` incorrectly is `\(1-\alpha\)`. > What is the probability of getting at least one false positive result if we test all three pairs of individual mean values in the Iris data set? -- `\(P(false\;positive) = 1 - P(no\;significant\;results)\)` which is equal to `\(1 - (1-0.05)^3 \approx 0.1426\)` If we test all three pairs of tests for the *Iris* data, what mean that we are expecting, .redinline[even if there are absolutely **no differences** between species], to reject the `\(H_O\)` at least 14% of the time! #### Change either `\(\alpha\)` (known as a Bonferroni Correction) or adjust the Model itself (preferred). --- # A Linear Model Approach .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ `\(y_{ij} = \mu + \tau_i + \epsilon_j\)` where: - `\(\mu\)` is the overall mean of all the data. - `\(\tau_i\)` is the deviation of the `\(i^{th}\)` treatment level from the overall mean. - `\(\epsilon_j\)` is the deviation of the observation from the treatment mean. #### The Null Hypothesis `\(H_O: \tau_i = 0\)` ] --- # Decomposing the Variance The Analysis of Variance Table for Testing the Null Hypothesis `\(H_O: \tau_i = 0\)`. Source | df | SS | MS --------|-------|--------------------------------------------------------------------|------------------------------------------------ Among | `\(K-1\)` | `\(\sum_{i=1}^K N_i \left( \bar{x}_i - \bar{x} \right)^2\)` | `\(\frac{SS_A}{K-1}\)` Within | `\(N-K\)` | `\(\sum_{i=1}^Kn_i\left( \sum_{j=1}^{N_i}(x_{ij}-\bar{x}_i)^2 \right)\)` | `\(\frac{SS_W}{N-K}\)` Total | `\(N-1\)` | `\(\sum_{i=1}^K \sum_{j=1}^{N_i} (x_{ij} - \bar{x})^2\)` | -- .pull-left[ Which is evaluated by the statistic: `\(F = \frac{MS_A}{MS_W}\)` ] .pull-right[ defined by the most excellent `\(F\)` distribution. `\(f(x | df_A, df_W) = \frac{\sqrt{\frac{(df_Ax)^{df_A}df_W^{df_W}}{(df_Ax + df_W)^{df_W+df_A}}}}{x\mathbf{B}\left( \frac{df_A}{2}, \frac{df_W}{2} \right)}\)` ] --- # Analysis of Variance (aov) ```r fit.aov <- aov( Sepal.Length ~ Species, data=iris) fit.aov ``` ``` ## Call: ## aov(formula = Sepal.Length ~ Species, data = iris) ## ## Terms: ## Species Residuals ## Sum of Squares 63.21213 38.95620 ## Deg. of Freedom 2 147 ## ## Residual standard error: 0.5147894 ## Estimated effects may be unbalanced ``` --- # Analysis of Variance Table ```r anova( fit.aov ) ``` ``` ## Analysis of Variance Table ## ## Response: Sepal.Length ## Df Sum Sq Mean Sq F value Pr(>F) ## Species 2 63.212 31.606 119.26 < 2.2e-16 *** ## Residuals 147 38.956 0.265 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse, sectionTitle # .yellow[Post-Hoc Tests] --- # Why Post-Hoc? Potential Outcomes where `\(H_O\)` is rejected: - `\(\mu_{setosa} \; \ne \mu_{virginica}\)` - `\(\mu_{setosa} \; \ne \mu_{versicolor}\)` - `\(\mu_{virginica} \; \ne \mu_{versicolor}\)` - `\(\mu_{setosa} \; \ne \; \mu_{virginica} \; \ne \mu_{versicolor}\)` -- By rejecting `\(H_O\)`, we do not know which of the previous situations caused the Among-Treatment Sums of Squares to be large enough to push the estimate of `\(F\)` into the rejection region of the distribution. --- # The Tukey Test This test examines all pairs of means and determines if their confidence intervals overlap or not. ```r tuk <- TukeyHSD(fit.aov) tuk ``` ``` ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = Sepal.Length ~ Species, data = iris) ## ## $Species ## diff lwr upr p adj ## versicolor-setosa 0.930 0.6862273 1.1737727 0 ## virginica-setosa 1.582 1.3382273 1.8257727 0 ## virginica-versicolor 0.652 0.4082273 0.8957727 0 ``` The difference is: `\(q = \frac{\bar{y}_{max} - \bar{y}_{min}}{\sigma \sqrt{2/N}}\)` and the confidence intervals are based upon Student-ized confidence ranges. --- # Visualizing the Tukey ```r plot( tuk, xlim=c(-1,2) ) ``` <img src="slides_files/figure-html/unnamed-chunk-19-1.png" width="504" style="display: block; margin: auto;" /> --- class: inverse, sectionTitle # .yellow[Two-Factor Models] --- # Analysis of Response with 2 Predicrors. There are times when we have more than on treatment category for a response. For example, a growth experiment may have a treatment that is high light and low light as well as one that has fewer than normal, normal, and more than normal nutrients. These are called *2-factor Analysis of Variance*. - Each factor can be considered independently. - The interaction between factors can be exmained as well. --- # Specifying A Two Factor Model Here i'm going to use a new data set because there is only one categorical variable in the *iris* data used above. Here is the old *Motor Trend Cars* dataset ```r df <- mtcars summary( df ) ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000 ``` --- # Quarter Mile Speed ~ Engine Type & Transmission ```r df %>% mutate( Engine = factor( ifelse( vs == 0, "V-Shaped", "Straight") ), Transmission = factor( ifelse( am == 0, "Automatic","Manual")) ) %>% select( Time = qsec, Engine, Transmission) -> df summary( df ) ``` ``` ## Time Engine Transmission ## Min. :14.50 Straight:14 Automatic:19 ## 1st Qu.:16.89 V-Shaped:18 Manual :13 ## Median :17.71 ## Mean :17.85 ## 3rd Qu.:18.90 ## Max. :22.90 ``` --- # One Factor Analyses .pull-left[ ```r fit.engine <- aov( Time ~ Engine, data=df ) anova(fit.engine) ``` ``` ## Analysis of Variance Table ## ## Response: Time ## Df Sum Sq Mean Sq F value Pr(>F) ## Engine 1 54.872 54.872 37.315 1.03e-06 *** ## Residuals 30 44.116 1.471 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] -- .pull-right[ ```r fit.transmission <- aov( Time ~ Transmission, data=df ) anova( fit.transmission ) ``` ``` ## Analysis of Variance Table ## ## Response: Time ## Df Sum Sq Mean Sq F value Pr(>F) ## Transmission 1 5.230 5.2301 1.6735 0.2057 ## Residuals 30 93.758 3.1253 ``` ] --- # Interaction Terms ```r fit.both <- aov( Time ~ Engine*Transmission, data=df ) anova( fit.both ) ``` ``` ## Analysis of Variance Table ## ## Response: Time ## Df Sum Sq Mean Sq F value Pr(>F) ## Engine 1 54.872 54.872 49.1643 1.261e-07 *** ## Transmission 1 12.853 12.853 11.5162 0.002077 ** ## Engine:Transmission 1 0.012 0.012 0.0104 0.919668 ## Residuals 28 31.251 1.116 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ```r plot( TukeyHSD( fit.both, "Engine")) ``` <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> --- ```r plot( TukeyHSD( fit.both, "Transmission")) ``` <img src="slides_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> --- ```r plot( TukeyHSD( fit.both, "Engine:Transmission")) ``` <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle background-position: right background-size: auto .center[ # Questions? ![Peter Sellers](https://live.staticflickr.com/65535/50382906427_2845eb1861_o_d.gif+) ] <p> </p> .bottom[ If you have any questions for about the content presented herein, please feel free to [submit them to me](mailto://rjdyer@vcu.edu) and I'll get back to you as soon as possible.]