####################
#
# Filename: 20110830.R
#
####################
#
# Purpose:  Show how to perform an
#  analysis of variance test in R
#  AND how to check the assumptions
#  using appropriate tests.
#
# Note: In class, we only used one or
#  two tests. here, I introduce several 
#



# As usual, let us start by reading in a data file

fb <- read.csv("http://courses.kvasaheim.com/pols6123/data/ncaa2009football.csv")
names(fb)
attach(fb)
summary(fb)

# Now, here is what we are gong to do: 
#  Compare six NCAA conferences in terms of points scored
#  in their football games in 2009.

# For analysis of variance, the hypotheses are ALWAYS:
#
#   H0: All means are equal.
#   HA: At least one mean is different.
#

# So, the ANOVA test

model1 <- aov(score~conference)    # Note it is 'aov' not 'ANOVA'
                                   # also note we store these results
								   # in a variable. IMPORTANT
# To see why it is important, 
aov(score~conference)              # this gives not much interesting
summary(model1)                    # this gives much more interesting
names(model1)                      # this lists all the information stored 
                                   #  in model1, which is the most interesting

# Assumption #1: The measurements are Normally distributed in each group
 # Graphical:
boxplot(score~conference)

hist(score[conference=="ACC"])
hist(score[conference=="Big 12"])
hist(score[conference=="Big East"])
hist(score[conference=="Big Ten"])
hist(score[conference=="Pac-10"])
hist(score[conference=="SEC"])
 # As these are utility graphs (not for publication), we can use the default
# Conclusion: None looked /too/ non-Normal. But, let's try numerical tests, too.


# Numerical:
var.test(score~conference)       # Error: too many groups (this test is only for two groups)

shapiro.test(score~conference)   # Error: must give it a single vector

shapiro.test(score[conference=="ACC"])
shapiro.test(score[conference=="Big 12"])
shapiro.test(score[conference=="Big East"])
shapiro.test(score[conference=="Big Ten"])
shapiro.test(score[conference=="Pac-10"])
shapiro.test(score[conference=="SEC"])

 # The Big 12 and SEC are not Normal according to Shapiro-Wilk, but
 #   remember the multiple testing issue.




# Assumption #2: The measurements in each group have the same variance
 # We can use graphical and numerical tests here, as well

# Graphical:
boxplot(score~conference)

 # As this is a utility graph (not for publication), we can use the default
# Conclusion: The conferences look similar with respect to variance


# Numerical:
bartlett.test(score~conference)

fligner.test(score~conference)

# Conclusion: It passes both tests, therefore the groups do not significantly
#   in terms of variance.



# Hypothesis conclusion:
#  As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail
#  to reject the null hypothesis. As such, we conclude that there is no
#  significant difference in average points scored per game across the 
#  six NCAA major conferences in 2009 (F=0.6159; df1=5; df2=774; p=0.6877).



# These pass the assumptions. Therefore, we can use the results from model1 with
#   confidence. However, we need to remain humble. ANOVA is based on Normality and
#   equal variance. We only showed that the assumptions were reasonable. Thus, the
#   results are only approximately accurate.






#####
# Non-parametric tests



# Let us pretend that the data and model failed one or both assumptions above.
# From today's notes, this means we have three options: 
#    1. We can transform the dependent variable (covered in Ch 5)
#    2. We can perform Monte Carlo tests (covered in a future class)
#    3. We can use non-parametric tests (covered here, below)


# The notes tell us that there are three useful non-paramteric tests when drawing
#  conclusions on population means. The one chosen depends on the number of groups.
#  Here, we have more than 2 groups, so we will use the Kruskal-Wallis test.


kruskal.test(score~conference)


# Conclusion:
#  As the p-value is (much) greater than our pre-chosen alpha=0.05, we fail
#  to reject the null hypothesis. As such, we conclude that there is no
#  significant difference in average points scored per game across the 
#  six NCAA major conferences in 2009 (X2=1.8754; df=5; p=0.8661).