#################### # # Filename: 20110830.R # #################### # # Purpose: Show how to perform an # analysis of variance test in R # AND how to check the assumptions # using appropriate tests. # # Note: In class, we only used one or # two tests. here, I introduce several # # As usual, let us start by reading in a data file fb <- read.csv("http://courses.kvasaheim.com/pols6123/data/ncaa2009football.csv") names(fb) attach(fb) summary(fb) # Now, here is what we are gong to do: # Compare six NCAA conferences in terms of points scored # in their football games in 2009. # For analysis of variance, the hypotheses are ALWAYS: # # H0: All means are equal. # HA: At least one mean is different. # # So, the ANOVA test model1 <- aov(score~conference) # Note it is 'aov' not 'ANOVA' # also note we store these results # in a variable. IMPORTANT # To see why it is important, aov(score~conference) # this gives not much interesting summary(model1) # this gives much more interesting names(model1) # this lists all the information stored # in model1, which is the most interesting # Assumption #1: The measurements are Normally distributed in each group # Graphical: boxplot(score~conference) hist(score[conference=="ACC"]) hist(score[conference=="Big 12"]) hist(score[conference=="Big East"]) hist(score[conference=="Big Ten"]) hist(score[conference=="Pac-10"]) hist(score[conference=="SEC"]) # As these are utility graphs (not for publication), we can use the default # Conclusion: None looked /too/ non-Normal. But, let's try numerical tests, too. # Numerical: var.test(score~conference) # Error: too many groups (this test is only for two groups) shapiro.test(score~conference) # Error: must give it a single vector shapiro.test(score[conference=="ACC"]) shapiro.test(score[conference=="Big 12"]) shapiro.test(score[conference=="Big East"]) shapiro.test(score[conference=="Big Ten"]) shapiro.test(score[conference=="Pac-10"]) shapiro.test(score[conference=="SEC"]) # The Big 12 and SEC are not Normal according to Shapiro-Wilk, but # remember the multiple testing issue. # Assumption #2: The measurements in each group have the same variance # We can use graphical and numerical tests here, as well # Graphical: boxplot(score~conference) # As this is a utility graph (not for publication), we can use the default # Conclusion: The conferences look similar with respect to variance # Numerical: bartlett.test(score~conference) fligner.test(score~conference) # Conclusion: It passes both tests, therefore the groups do not significantly # in terms of variance. # Hypothesis conclusion: # As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail # to reject the null hypothesis. As such, we conclude that there is no # significant difference in average points scored per game across the # six NCAA major conferences in 2009 (F=0.6159; df1=5; df2=774; p=0.6877). # These pass the assumptions. Therefore, we can use the results from model1 with # confidence. However, we need to remain humble. ANOVA is based on Normality and # equal variance. We only showed that the assumptions were reasonable. Thus, the # results are only approximately accurate. ##### # Non-parametric tests # Let us pretend that the data and model failed one or both assumptions above. # From today's notes, this means we have three options: # 1. We can transform the dependent variable (covered in Ch 5) # 2. We can perform Monte Carlo tests (covered in a future class) # 3. We can use non-parametric tests (covered here, below) # The notes tell us that there are three useful non-paramteric tests when drawing # conclusions on population means. The one chosen depends on the number of groups. # Here, we have more than 2 groups, so we will use the Kruskal-Wallis test. kruskal.test(score~conference) # Conclusion: # As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail # to reject the null hypothesis. As such, we conclude that there is no # significant difference in average points scored per game across the # six NCAA major conferences in 2009 (X2=1.8754; df=5; p=0.8661).