####################
#
# Filename: 20110830.R
#
####################
#
# Purpose: Show how to perform an
# analysis of variance test in R
# AND how to check the assumptions
# using appropriate tests.
#
# Note: In class, we only used one or
# two tests. here, I introduce several
#
# As usual, let us start by reading in a data file
fb <- read.csv("http://courses.kvasaheim.com/pols6123/data/ncaa2009football.csv")
names(fb)
attach(fb)
summary(fb)
# Now, here is what we are gong to do:
# Compare six NCAA conferences in terms of points scored
# in their football games in 2009.
# For analysis of variance, the hypotheses are ALWAYS:
#
# H0: All means are equal.
# HA: At least one mean is different.
#
# So, the ANOVA test
model1 <- aov(score~conference) # Note it is 'aov' not 'ANOVA'
# also note we store these results
# in a variable. IMPORTANT
# To see why it is important,
aov(score~conference) # this gives not much interesting
summary(model1) # this gives much more interesting
names(model1) # this lists all the information stored
# in model1, which is the most interesting
# Assumption #1: The measurements are Normally distributed in each group
# Graphical:
boxplot(score~conference)
hist(score[conference=="ACC"])
hist(score[conference=="Big 12"])
hist(score[conference=="Big East"])
hist(score[conference=="Big Ten"])
hist(score[conference=="Pac-10"])
hist(score[conference=="SEC"])
# As these are utility graphs (not for publication), we can use the default
# Conclusion: None looked /too/ non-Normal. But, let's try numerical tests, too.
# Numerical:
var.test(score~conference) # Error: too many groups (this test is only for two groups)
shapiro.test(score~conference) # Error: must give it a single vector
shapiro.test(score[conference=="ACC"])
shapiro.test(score[conference=="Big 12"])
shapiro.test(score[conference=="Big East"])
shapiro.test(score[conference=="Big Ten"])
shapiro.test(score[conference=="Pac-10"])
shapiro.test(score[conference=="SEC"])
# The Big 12 and SEC are not Normal according to Shapiro-Wilk, but
# remember the multiple testing issue.
# Assumption #2: The measurements in each group have the same variance
# We can use graphical and numerical tests here, as well
# Graphical:
boxplot(score~conference)
# As this is a utility graph (not for publication), we can use the default
# Conclusion: The conferences look similar with respect to variance
# Numerical:
bartlett.test(score~conference)
fligner.test(score~conference)
# Conclusion: It passes both tests, therefore the groups do not significantly
# in terms of variance.
# Hypothesis conclusion:
# As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail
# to reject the null hypothesis. As such, we conclude that there is no
# significant difference in average points scored per game across the
# six NCAA major conferences in 2009 (F=0.6159; df1=5; df2=774; p=0.6877).
# These pass the assumptions. Therefore, we can use the results from model1 with
# confidence. However, we need to remain humble. ANOVA is based on Normality and
# equal variance. We only showed that the assumptions were reasonable. Thus, the
# results are only approximately accurate.
#####
# Non-parametric tests
# Let us pretend that the data and model failed one or both assumptions above.
# From today's notes, this means we have three options:
# 1. We can transform the dependent variable (covered in Ch 5)
# 2. We can perform Monte Carlo tests (covered in a future class)
# 3. We can use non-parametric tests (covered here, below)
# The notes tell us that there are three useful non-paramteric tests when drawing
# conclusions on population means. The one chosen depends on the number of groups.
# Here, we have more than 2 groups, so we will use the Kruskal-Wallis test.
kruskal.test(score~conference)
# Conclusion:
# As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail
# to reject the null hypothesis. As such, we conclude that there is no
# significant difference in average points scored per game across the
# six NCAA major conferences in 2009 (X2=1.8754; df=5; p=0.8661).