####################
#
# Filename: 20110913.R
#
####################
#
# Today: Two instances of the use of Monte Carlo
# techniques. The first instance (Chapter 2) tests
# the validity of a statistical test. The second
# instance determines the distribution of a test
# statistic that you have already defined.


##################################################

# Testing a test

# Note: For a test to be appropriate, the p-values
#       must be Uniformly distributed between 0 and 1.

#####
# I: Test the t.test to see if it is appropriate when
#    the sample size is 30 and the measurements are
#    Normally distributed. (It should be.)



# 1. Initialize
set.seed(577)
p      <- numeric()
trials <- 10000
n      <- 30

# 2. Loop
for(i in 1:trials) {
  x    <- rnorm(n, m=0, s=1)        # Core (Normal)
  p[i] <- t.test(x,mu=0)$p.value    # Core (the test)
}

# 3. Display results
hist(p) 

# Note that the histogram has no apparent increasing 
# or decreasing trends. It may not be flat, but the 
# variation looks totally random. Increasing the
# number of trials will even this out even more. 
# However, increasing the number of trials will also
# increase the time it takes to run this test.

#

# NOW, what if the Normality assumption is violated?
# Can we still use the t-test? What if our sample
# size were 30? Let us find out. We will only change
# the core of the above script.

# 1. Initialize
set.seed(577)
p      <- numeric()
trials <- 10000
n      <- 30

# 2. Loop
for(i in 1:trials) {
  x    <- rexp(n, rate=1)           # Core (Exponential)
  p[i] <- t.test(x,mu=1)$p.value    # Core (the test)
}

# 3. Display results
hist(p) 


# Note that the first bar in the histogram is MUCH 
# larger than we would expect if the test were
# appropriate. As such, the t-test is inappropriate
# in this circumstance.

#

# BUT, what if the sample size were n=100? Could we 
# use the t-test then? Well, let us find out:

# 1. Initialize
set.seed(577)
p      <- numeric()
trials <- 10000
n      <- 100

# 2. Loop
for(i in 1:trials) {
  x    <- rexp(n, rate=1)           # Core (Exponential)
  p[i] <- t.test(x,mu=1)$p.value    # Core (the test)
}

# 3. Display results
hist(p) 


# What changed in the script? 
# What were the results? Is that first bar about the 
# same height as the other bars? If it is difficult to 
# tell, then we can use the Kolmogorov-Smirnov test, 
# which tests equality of two distributions:

ks.test(p,punif)  # Null hypothesis: distributions are equal

# So, is a sample size of n=100 large enough to allow
# us to use the t-test appropriately? The K-S test says yes.


# BUT, what if your data is Uniformly distributed between
# 0 and 100? Can you use the t-test when your sample size
# is n=30?  Let us find out.

# 1. Initialize
set.seed(577)
p      <- numeric()
trials <- 10000
n      <- 30

# 2. Loop
for(i in 1:trials) {
  x    <- runif(n, min=0, max=100)   # Core (Uniform)
  p[i] <- t.test(x,mu=50)$p.value    # Core (the test)
}

# 3. Display results
hist(p) 
ks.test(p,punif)

# What changed in the script? Why did they have to change?
# What are the results? Can we use the t-test under these
# circumstances?

# Ans: Apparently. (How do I know?)


##################################################

# Determining the distribution of your test statistic (TS)

# So, you actually know how your data should be distributed 
# under the null hypothesis (rare, but it happens). It is
# probably distributed according to the Binomial distribution,
# which takes two parameters: size (n) and probability (pi).

# There is no test statistic for testing the equality of two 
# pi values. But, you can create your own test statistic (see 
# guidelines in the notes). Once you have your test statistic,
# in order to draw conclusions from it, you will have to figure
# out its distribution. 

# Monte Carlo can help you estimate its distribution. To wit:

# Let us suppose we have 10 years of hurricane data for both
# Alabama and Mississippi. During those 10 years, a hurricane
# made landfall on the Alabama coast in four of the years; 
# on the Mississippi coast, two of the years.

# Thus, if our null hypothesis is correct (equal probability),
# then that probability is most likely pi=0.3.  (Why?)

# Let our test statistic be the difference in the number of years
# with a hurricane landfall between Alabama and Mississippi:

#            TS = Xa - Xm

# For our data, TS=2. Is this large enough to reject our 
# null hypothesis? The only way to know is to know the 
# distribution of TS. 

# To do that, we use Monte Carlo:


# 1. Initialize
set.seed(577)
TS     <- numeric()
trials <- 10000
years  <- 10
pi     <- 0.30

# 2. Loop
for(i in 1:trials) {
  Xa    <- rbinom(1, size=years, prob=pi)   # Core (Alabama dist)
  Xm    <- rbinom(1, size=years, prob=pi)   # Core (Miss dist)
  TS[i] <- Xa - Xm                          # Core (the test)
}

# 3. Display results
hist(TS)                          # Histogram
quantile(TS, c(0.025,0.975))      # 95% Confidence interval
length( which(TS>2) )/trials * 2  # p-value    


# Note the differences between this and the previous Monte
# Carlo experiments. Also (more importantly) note the
# similarities.

# So, do we reject the null hypothesis and conclude that 
# hurricanes hit Alabama at a significantly different rate
# than they hit Mississippi? 

# Is reality (TS=2) within the confidnece interval?
# Is our p-value greater than our alpha?

# What does this mean?

# Conclusion:

# The difference in hurricane landfalls, 2, is not significantly
# large enough. Thus, at the alpha=0.05 level, we fail to reject 
# the null hypothesis and conclude that hurricanes do not make 
# landfall in Alabama at a significantly different rate than 
# they do in Mississippi (p = 0.2154).



##################################################

# Extension

# What if you were comparing Alabama landfalls and Florida landfalls?
# What would change? What would you have to do to account for the
# fact that the Florida coastline is much larger than Alabama's?

# Florida's coastline is 1350 miles.
# Alabama's coastline is  100 miles.

# Let us suppose that you ignored this (which may be appropriate,
# depending on your research question/hypothesis). 

# In the last decade, a hurricane made direct landfall in Florida 
# in six of the 10 years; in Alabama, one.

# H0: Is the probability of a hurricane making landfall Florida 
#     the same as one making landfall in Alabama?
# (I got the numbers from NOAA. They cover 2000-2010.)

# 1. Define our test statistic: TS = Xa - Xf
#    Here, our real TS = 5.

# 2. Determine the distribution of TS (under the null):

set.seed(577)
TS     <- numeric()
trials <- 10000
years  <- 10
pi     <- 0.35 # = 7/20

for(i in 1:trials) {
  Xa    <- rbinom(1, size=years, prob=pi)   # Core (Alabama dist)
  Xf    <- rbinom(1, size=years, prob=pi)   # Core (Florida dist)
  TS[i] <- Xa - Xf                          # Core (the test)
}

hist(TS)                          # Histogram
quantile(TS, c(0.025,0.975))      # 95% Confidence interval
length( which(TS>5) )/trials * 2  # p-value 

# What is our conclusion?

# The difference in hurricane landfalls between Florida and Alabama,
# five, is large enough that we can reject the null hypothesis at
# the alpha=0.05 level and conclude that the probability of a 
# hurricane making landfall in Florida is larger than that for
# Alabama (p = 0.0076).