##### SCA-11
##### 
##### One-Sample Means Tests
##### 

### This gives a few examples of the analysis process for testing
### for a population mean



### Preamble

# Import extra functionality
source("http://rfs.kvasaheim.com/stat200.R")

# Read in data
dt = read.csv("http://rfs.kvasaheim.com/data/ocanada.csv")
summary(dt)

attach(dt)



### Part I: Average Class Size
# 
#   Because I would like to estimate the average class size,
#   I am estimating a single population mean. As such, I would
#   prefer to use a one-smaple t-procedure (it is the most powerful 
#   of our options). That procedure has an assumption: The data
#   are generated from a Normal distribution (process). To check
#   this assumption, we look at the distribution and perform a 
#   Shapiro-Wilk test.

normoverlay(classSize)
shapiroTest(classSize)

#   Because the distribution does not appear too different from Normal
#   and the Shapiro-Wilk test does not detect a violation of this
#   assumption (p-value = 0.5892), we can use the one-sample t-procedure.

t.test(classSize)

#   According to the one-sample t-procedure, we are 95% confident that 
#   the average class size is between 20.2 and 28.4 students.

# Basic graphic
boxplot(classSize, ylab="Class Size")




### Part II: Average Age
#
#   I hypothesize that the average age in the classrooms is 21. To
#   test this, I would prefer to use a one-sample t-test. It requires
#   that the data are generated from a Normal process. To check this
#   requirement, we look at a histogram of the data and we perform
#   the Shapiro-Wilk test.

normoverlay(averageAge)
shapiroTest(averageAge)

#   Because the distribution does not appear too Normal and the 
#   Shapiro-Wilk test does detect a violation of this assumption 
#   (p-value=0.0238), we cannot use the one-sample t-test.
#
#   The next most powerful test to try is the Wilcoxon test. It assumes
#   the data are from a symmetric distribution. To check this, we use
#   the Hildebrand Rule.

hildebrand.rule(averageAge)

#   Fortunately, the data do seem to come from a symmetric process. 
#   Thus, we can/should use the Wilcoxon test.

wilcox.test(averageAge, mu=21, conf.int=TRUE)

#   According to the Wilcoxon test, the average age in the classes is
#   not 21 years (p-value = 0.000082). A 95% confidence interval for the 
#   average age is from 19.95 to 19.94 years.


# Basic graphic
boxplot(averageAge, ylab="Average Age")
abline(h=21, col="red")






### Part III: Average proportion correct
#
#   I hypothesize that the average proportion of students who can 
#   locate Canada on the map is greater than 85%. To test this, we
#   may want to use the one-sample t-test. It requires that the data 
#   are generated from a Normal process. To check this requirement, 
#   we look at a histogram of the data and we perform the Shapiro-Wilk 
#   test.

propCorrect = correct/classSize
normoverlay(propCorrect)
shapiroTest(propCorrect)

#   Because the distribution appears quite different from Normal
#   and the Shapiro-Wilk test detects a violation of this
#   assumption (p-value=0.0011), we cannot use the one-sample t-test.
#
#   The next most powerful test to try is the Wilcoxon test. It assumes
#   the data are from a symmetric distribution. To check this, we use
#   the Hildebrand Rule.

hildebrand.rule(propCorrect)

#   Unfortunately, the data do not come from a symmetric process. Thus, we
#   must use the non-parametric bootstrap.

mn = numeric()
for(i in 1:1e4) {
  x = sample(propCorrect, replace=TRUE)
  mn[i] = mean(x)
}
mean(mn<=0.85)
quantile(mn, c(0.025,0.975))

#   Because the p-value of 0.9971 is not less than alpha, we cannot 
#   reject the null hypothesis. The claim (alternative hypothesis) 
#   that more than 85% of the students can find Canada on the map is 
#   not supported by this data. In fact, we are 95% confident that 
#   the proportion of students who can locate Canada on the map is 
#   between 50.9% and 80.4%.


### NOTE: Some may argue that Part III could also use a Binomial test. Here
#   is the results from that analysis:

totCorrect  = sum(correct)
totStudents = sum(classSize)

binom.test(totCorrect,totStudents, p=0.85, alternative="greater")
binom.test(totCorrect,totStudents)

#   Because the p-value of 1.0000 is not less than alpha, we cannot 
#   reject the null hypothesis. The claim (alternative hypothesis) 
#   that more than 85% of the students can find Canada on the map is 
#   not supported by this data. In fact, we are 95% confident that 
#   the proportion of students who can locate Canada on the map is 
#   between 61.9% and 70.5%.

### NOTE: The results are substantively the same. This is not surprising.
#   When multiple procedures are possible, the results should be quite
#   similar. Here, both procedures were appropriate. The Binomial Test,
#   however, was the better of the two because our hypothesis was about
#   a population proportion.
#
#   A discussion of the Binomial test comes in SCA-12.