##### SCA-11 ##### ##### One-Sample Means Tests ##### ### This gives a few examples of the analysis process for testing ### for a population mean ### Preamble # Import extra functionality source("http://rfs.kvasaheim.com/stat200.R") # Read in data dt = read.csv("http://rfs.kvasaheim.com/data/ocanada.csv") summary(dt) attach(dt) ### Part I: Average Class Size # # Because I would like to estimate the average class size, # I am estimating a single population mean. As such, I would # prefer to use a one-smaple t-procedure (it is the most powerful # of our options). That procedure has an assumption: The data # are generated from a Normal distribution (process). To check # this assumption, we look at the distribution and perform a # Shapiro-Wilk test. normoverlay(classSize) shapiroTest(classSize) # Because the distribution does not appear too different from Normal # and the Shapiro-Wilk test does not detect a violation of this # assumption (p-value = 0.5892), we can use the one-sample t-procedure. t.test(classSize) # According to the one-sample t-procedure, we are 95% confident that # the average class size is between 20.2 and 28.4 students. # Basic graphic boxplot(classSize, ylab="Class Size") ### Part II: Average Age # # I hypothesize that the average age in the classrooms is 21. To # test this, I would prefer to use a one-sample t-test. It requires # that the data are generated from a Normal process. To check this # requirement, we look at a histogram of the data and we perform # the Shapiro-Wilk test. normoverlay(averageAge) shapiroTest(averageAge) # Because the distribution does not appear too Normal and the # Shapiro-Wilk test does detect a violation of this assumption # (p-value=0.0238), we cannot use the one-sample t-test. # # The next most powerful test to try is the Wilcoxon test. It assumes # the data are from a symmetric distribution. To check this, we use # the Hildebrand Rule. hildebrand.rule(averageAge) # Fortunately, the data do seem to come from a symmetric process. # Thus, we can/should use the Wilcoxon test. wilcox.test(averageAge, mu=21, conf.int=TRUE) # According to the Wilcoxon test, the average age in the classes is # not 21 years (p-value = 0.000082). A 95% confidence interval for the # average age is from 19.95 to 19.94 years. # Basic graphic boxplot(averageAge, ylab="Average Age") abline(h=21, col="red") ### Part III: Average proportion correct # # I hypothesize that the average proportion of students who can # locate Canada on the map is greater than 85%. To test this, we # may want to use the one-sample t-test. It requires that the data # are generated from a Normal process. To check this requirement, # we look at a histogram of the data and we perform the Shapiro-Wilk # test. propCorrect = correct/classSize normoverlay(propCorrect) shapiroTest(propCorrect) # Because the distribution appears quite different from Normal # and the Shapiro-Wilk test detects a violation of this # assumption (p-value=0.0011), we cannot use the one-sample t-test. # # The next most powerful test to try is the Wilcoxon test. It assumes # the data are from a symmetric distribution. To check this, we use # the Hildebrand Rule. hildebrand.rule(propCorrect) # Unfortunately, the data do not come from a symmetric process. Thus, we # must use the non-parametric bootstrap. mn = numeric() for(i in 1:1e4) { x = sample(propCorrect, replace=TRUE) mn[i] = mean(x) } mean(mn<=0.85) quantile(mn, c(0.025,0.975)) # Because the p-value of 0.9971 is not less than alpha, we cannot # reject the null hypothesis. The claim (alternative hypothesis) # that more than 85% of the students can find Canada on the map is # not supported by this data. In fact, we are 95% confident that # the proportion of students who can locate Canada on the map is # between 50.9% and 80.4%. ### NOTE: Some may argue that Part III could also use a Binomial test. Here # is the results from that analysis: totCorrect = sum(correct) totStudents = sum(classSize) binom.test(totCorrect,totStudents, p=0.85, alternative="greater") binom.test(totCorrect,totStudents) # Because the p-value of 1.0000 is not less than alpha, we cannot # reject the null hypothesis. The claim (alternative hypothesis) # that more than 85% of the students can find Canada on the map is # not supported by this data. In fact, we are 95% confident that # the proportion of students who can locate Canada on the map is # between 61.9% and 70.5%. ### NOTE: The results are substantively the same. This is not surprising. # When multiple procedures are possible, the results should be quite # similar. Here, both procedures were appropriate. The Binomial Test, # however, was the better of the two because our hypothesis was about # a population proportion. # # A discussion of the Binomial test comes in SCA-12.