########################################
#
#  Script: 25 January 2011 (20110125.R)
#
########################################

# Today:
#
#   ** Statistical analysis of some data
#      This includes: 
#      - selecting the appropriate test
#      - testing the assumptions
#      - coming to the correct conclusion
#
#      - the same as last Thursday, but with more pizzazz!

# But first, let's do proportion tests. Recall from your first course that the
# usual test of equality of a proportion is 
#              p - p0
#   z = ---------------------
#         sqrt[ n*p*(1-p) ]
#
# where p0 is the hypothesized proportion and p is the estimated proportion. This,
# however, is an approximation that only works for large n and p near 0.500. The
# distribution of z is approximately Normal. 
#
# But, we can use elementary rules of probability and recognize that np is exactly
# distributed as a Binomial. Then, we just go to our CDF of the binomial and look up
# the appropriate tail probability. This is exact, not approximate.
#
#  H0:   The proportion of my hairs that are grey is 0.25.
#  Data: I count a sample of 20 hairs and 12 are grey.

# In R:
prop.test(12,20, p=0.25)     # Approximate version (Normal approximation)
binom.test(12,20, p=0.25)    # Exact version

# To compare two sample proportions, you must use the approximation, as
# the difference of two Binomials is not Binomial (but, the difference
# of two Normals IS Normal).



# Now, on to data analysis



##### 
#  The ncaa2009football dataset:
#
#  This will give practice in comparing two means appropriately when there are are
#  more than two groups in the dataset. This last part is important. 

# Research question:
#     Does the average number of points per game (ppg) differ between conferences
#     to a statistical level?
#
# My friend's hypothesis: 
#     The SEC is awesome!!! Go Vols!
#     The SEC scores more than any other conference (statistically speaking).
#     Booyah!!!
#
# Translated NULL hypothesis:
#     The mean ppg for SEC is less than *or equal to* the mean ppg for each of 
#     the other conferences.
#
# Translated ALTERNATIVE hypothesis:
#     The mean ppg for the SEC is greater than the mean ppg for each of the other conferences.
#
# 
# Thus, we are comparing the mean ppg for the SEC to each of the other conferences. Thus, my
# first thought is to use a series of t-tests.


# Standard preamble
fball <- read.csv("http://courses.kvasaheim.com/stat40x3/data/ncaa2009football.csv", header=TRUE)
names(fball)
summary(fball)

# Graph the data to see what to expect
boxplot(score~conference, data=fball, las=1, ylab="Points Scored", xlab="NCAA Conference")
mSEC <- median(fball$score[fball$conference=="SEC"])
abline(h=mSEC, col=2)

# Thus, we should expect that there are no statistically significant differences

# A t-test? Let us check normality (why?)
shapiro.test(fball$score[fball$conference=="ACC"])       # No fail
shapiro.test(fball$score[fball$conference=="Big 12"])    # Fail
shapiro.test(fball$score[fball$conference=="Big East"])  # No fail
shapiro.test(fball$score[fball$conference=="Big Ten"])   # No fail
shapiro.test(fball$score[fball$conference=="Pac-10"])    # No fail
shapiro.test(fball$score[fball$conference=="SEC"])       # Fail

# A discussion on what to do


# ... when in doubt, do both to see if there is much difference

# A parametric option:
t.test(fball$score[fball$conference=="ACC"],fball$score[fball$conference=="SEC"])
t.test(fball$score[fball$conference=="Big 12"],fball$score[fball$conference=="SEC"])
t.test(fball$score[fball$conference=="Big East"],fball$score[fball$conference=="SEC"])
t.test(fball$score[fball$conference=="Big Ten"],fball$score[fball$conference=="SEC"])
t.test(fball$score[fball$conference=="Pac-10"],fball$score[fball$conference=="SEC"])
# Nope: none of these differences is statistically significant.


# A non-parametric option:
wilcox.test(fball$score[fball$conference=="ACC"],fball$score[fball$conference=="SEC"])
wilcox.test(fball$score[fball$conference=="Big 12"],fball$score[fball$conference=="SEC"])
wilcox.test(fball$score[fball$conference=="Big East"],fball$score[fball$conference=="SEC"])
wilcox.test(fball$score[fball$conference=="Big Ten"],fball$score[fball$conference=="SEC"])
wilcox.test(fball$score[fball$conference=="Pac-10"],fball$score[fball$conference=="SEC"])
# Again, nope: none of these differences is statistically significant.



# But, we have a big problem: 
#
#            What is the Type I error of these tests?
#
# In general, our Type I error rate is alpha = 0.05. This means (among other things) that
# we have a 95% probability of not rejecting the null hypothesis if the null is correct. 
# When we perform multiple tests, we are testing EACH at the 0.05 level, but this means
# we have a 0.95^n chance of not rejecting ALL null hypotheses, which means that our 
# actual Type I error rate is 1-0.95^t. Thus, for the 5 tests above, our true Type I
# error rate is alpha = 0.226.
1-0.95^5

# Had we compared each conference against each other conference, the true Type I error
# rate would be 1-0.95^15 = 0.537.
1-0.95^15

# The problem comes from multiple tests. Can we test the equality of means using just
# one test? 
# Yes! Analysis of Variance!

# To the computer without any background! (The background comes on Thursday.)
#
#
# For Analysis of Variance, the function is     aov()
# The non-parametric test is                    kruskal.test()
#

model1 <- aov(score~conference, data=fball)
summary(model1)

model2 <- kruskal.test(score~conference, data=fball)
print(model2)

# Note the differences in how these are 'called' from what we have done in the past. 
# I saved the results in a variable, then called the summary function to give me a lot
# of information for the aov model and a print to get the information from the 
# Kruskal-Wallis test model.

# We will pick it up here on Thursday


# Now, let's do some work on the board motivating Analysis of Variance (Ch 8)