######################################## # # Script: 25 January 2011 (20110125.R) # ######################################## # Today: # # ** Statistical analysis of some data # This includes: # - selecting the appropriate test # - testing the assumptions # - coming to the correct conclusion # # - the same as last Thursday, but with more pizzazz! # But first, let's do proportion tests. Recall from your first course that the # usual test of equality of a proportion is # p - p0 # z = --------------------- # sqrt[ n*p*(1-p) ] # # where p0 is the hypothesized proportion and p is the estimated proportion. This, # however, is an approximation that only works for large n and p near 0.500. The # distribution of z is approximately Normal. # # But, we can use elementary rules of probability and recognize that np is exactly # distributed as a Binomial. Then, we just go to our CDF of the binomial and look up # the appropriate tail probability. This is exact, not approximate. # # H0: The proportion of my hairs that are grey is 0.25. # Data: I count a sample of 20 hairs and 12 are grey. # In R: prop.test(12,20, p=0.25) # Approximate version (Normal approximation) binom.test(12,20, p=0.25) # Exact version # To compare two sample proportions, you must use the approximation, as # the difference of two Binomials is not Binomial (but, the difference # of two Normals IS Normal). # Now, on to data analysis ##### # The ncaa2009football dataset: # # This will give practice in comparing two means appropriately when there are are # more than two groups in the dataset. This last part is important. # Research question: # Does the average number of points per game (ppg) differ between conferences # to a statistical level? # # My friend's hypothesis: # The SEC is awesome!!! Go Vols! # The SEC scores more than any other conference (statistically speaking). # Booyah!!! # # Translated NULL hypothesis: # The mean ppg for SEC is less than *or equal to* the mean ppg for each of # the other conferences. # # Translated ALTERNATIVE hypothesis: # The mean ppg for the SEC is greater than the mean ppg for each of the other conferences. # # # Thus, we are comparing the mean ppg for the SEC to each of the other conferences. Thus, my # first thought is to use a series of t-tests. # Standard preamble fball <- read.csv("http://courses.kvasaheim.com/stat40x3/data/ncaa2009football.csv", header=TRUE) names(fball) summary(fball) # Graph the data to see what to expect boxplot(score~conference, data=fball, las=1, ylab="Points Scored", xlab="NCAA Conference") mSEC <- median(fball$score[fball$conference=="SEC"]) abline(h=mSEC, col=2) # Thus, we should expect that there are no statistically significant differences # A t-test? Let us check normality (why?) shapiro.test(fball$score[fball$conference=="ACC"]) # No fail shapiro.test(fball$score[fball$conference=="Big 12"]) # Fail shapiro.test(fball$score[fball$conference=="Big East"]) # No fail shapiro.test(fball$score[fball$conference=="Big Ten"]) # No fail shapiro.test(fball$score[fball$conference=="Pac-10"]) # No fail shapiro.test(fball$score[fball$conference=="SEC"]) # Fail # A discussion on what to do # ... when in doubt, do both to see if there is much difference # A parametric option: t.test(fball$score[fball$conference=="ACC"],fball$score[fball$conference=="SEC"]) t.test(fball$score[fball$conference=="Big 12"],fball$score[fball$conference=="SEC"]) t.test(fball$score[fball$conference=="Big East"],fball$score[fball$conference=="SEC"]) t.test(fball$score[fball$conference=="Big Ten"],fball$score[fball$conference=="SEC"]) t.test(fball$score[fball$conference=="Pac-10"],fball$score[fball$conference=="SEC"]) # Nope: none of these differences is statistically significant. # A non-parametric option: wilcox.test(fball$score[fball$conference=="ACC"],fball$score[fball$conference=="SEC"]) wilcox.test(fball$score[fball$conference=="Big 12"],fball$score[fball$conference=="SEC"]) wilcox.test(fball$score[fball$conference=="Big East"],fball$score[fball$conference=="SEC"]) wilcox.test(fball$score[fball$conference=="Big Ten"],fball$score[fball$conference=="SEC"]) wilcox.test(fball$score[fball$conference=="Pac-10"],fball$score[fball$conference=="SEC"]) # Again, nope: none of these differences is statistically significant. # But, we have a big problem: # # What is the Type I error of these tests? # # In general, our Type I error rate is alpha = 0.05. This means (among other things) that # we have a 95% probability of not rejecting the null hypothesis if the null is correct. # When we perform multiple tests, we are testing EACH at the 0.05 level, but this means # we have a 0.95^n chance of not rejecting ALL null hypotheses, which means that our # actual Type I error rate is 1-0.95^t. Thus, for the 5 tests above, our true Type I # error rate is alpha = 0.226. 1-0.95^5 # Had we compared each conference against each other conference, the true Type I error # rate would be 1-0.95^15 = 0.537. 1-0.95^15 # The problem comes from multiple tests. Can we test the equality of means using just # one test? # Yes! Analysis of Variance! # To the computer without any background! (The background comes on Thursday.) # # # For Analysis of Variance, the function is aov() # The non-parametric test is kruskal.test() # model1 <- aov(score~conference, data=fball) summary(model1) model2 <- kruskal.test(score~conference, data=fball) print(model2) # Note the differences in how these are 'called' from what we have done in the past. # I saved the results in a variable, then called the summary function to give me a lot # of information for the aov model and a print to get the information from the # Kruskal-Wallis test model. # We will pick it up here on Thursday # Now, let's do some work on the board motivating Analysis of Variance (Ch 8)