####################
#
# Filename: 20110913.R
#
####################
#
# Today: Two instances of the use of Monte Carlo
# techniques. The first instance (Chapter 2) tests
# the validity of a statistical test. The second
# instance determines the distribution of a test
# statistic that you have already defined.
##################################################
# Testing a test
# Note: For a test to be appropriate, the p-values
# must be Uniformly distributed between 0 and 1.
#####
# I: Test the t.test to see if it is appropriate when
# the sample size is 30 and the measurements are
# Normally distributed. (It should be.)
# 1. Initialize
set.seed(577)
p <- numeric()
trials <- 10000
n <- 30
# 2. Loop
for(i in 1:trials) {
x <- rnorm(n, m=0, s=1) # Core (Normal)
p[i] <- t.test(x,mu=0)$p.value # Core (the test)
}
# 3. Display results
hist(p)
# Note that the histogram has no apparent increasing
# or decreasing trends. It may not be flat, but the
# variation looks totally random. Increasing the
# number of trials will even this out even more.
# However, increasing the number of trials will also
# increase the time it takes to run this test.
#
# NOW, what if the Normality assumption is violated?
# Can we still use the t-test? What if our sample
# size were 30? Let us find out. We will only change
# the core of the above script.
# 1. Initialize
set.seed(577)
p <- numeric()
trials <- 10000
n <- 30
# 2. Loop
for(i in 1:trials) {
x <- rexp(n, rate=1) # Core (Exponential)
p[i] <- t.test(x,mu=1)$p.value # Core (the test)
}
# 3. Display results
hist(p)
# Note that the first bar in the histogram is MUCH
# larger than we would expect if the test were
# appropriate. As such, the t-test is inappropriate
# in this circumstance.
#
# BUT, what if the sample size were n=100? Could we
# use the t-test then? Well, let us find out:
# 1. Initialize
set.seed(577)
p <- numeric()
trials <- 10000
n <- 100
# 2. Loop
for(i in 1:trials) {
x <- rexp(n, rate=1) # Core (Exponential)
p[i] <- t.test(x,mu=1)$p.value # Core (the test)
}
# 3. Display results
hist(p)
# What changed in the script?
# What were the results? Is that first bar about the
# same height as the other bars? If it is difficult to
# tell, then we can use the Kolmogorov-Smirnov test,
# which tests equality of two distributions:
ks.test(p,punif) # Null hypothesis: distributions are equal
# So, is a sample size of n=100 large enough to allow
# us to use the t-test appropriately? The K-S test says yes.
# BUT, what if your data is Uniformly distributed between
# 0 and 100? Can you use the t-test when your sample size
# is n=30? Let us find out.
# 1. Initialize
set.seed(577)
p <- numeric()
trials <- 10000
n <- 30
# 2. Loop
for(i in 1:trials) {
x <- runif(n, min=0, max=100) # Core (Uniform)
p[i] <- t.test(x,mu=50)$p.value # Core (the test)
}
# 3. Display results
hist(p)
ks.test(p,punif)
# What changed in the script? Why did they have to change?
# What are the results? Can we use the t-test under these
# circumstances?
# Ans: Apparently. (How do I know?)
##################################################
# Determining the distribution of your test statistic (TS)
# So, you actually know how your data should be distributed
# under the null hypothesis (rare, but it happens). It is
# probably distributed according to the Binomial distribution,
# which takes two parameters: size (n) and probability (pi).
# There is no test statistic for testing the equality of two
# pi values. But, you can create your own test statistic (see
# guidelines in the notes). Once you have your test statistic,
# in order to draw conclusions from it, you will have to figure
# out its distribution.
# Monte Carlo can help you estimate its distribution. To wit:
# Let us suppose we have 10 years of hurricane data for both
# Alabama and Mississippi. During those 10 years, a hurricane
# made landfall on the Alabama coast in four of the years;
# on the Mississippi coast, two of the years.
# Thus, if our null hypothesis is correct (equal probability),
# then that probability is most likely pi=0.3. (Why?)
# Let our test statistic be the difference in the number of years
# with a hurricane landfall between Alabama and Mississippi:
# TS = Xa - Xm
# For our data, TS=2. Is this large enough to reject our
# null hypothesis? The only way to know is to know the
# distribution of TS.
# To do that, we use Monte Carlo:
# 1. Initialize
set.seed(577)
TS <- numeric()
trials <- 10000
years <- 10
pi <- 0.30
# 2. Loop
for(i in 1:trials) {
Xa <- rbinom(1, size=years, prob=pi) # Core (Alabama dist)
Xm <- rbinom(1, size=years, prob=pi) # Core (Miss dist)
TS[i] <- Xa - Xm # Core (the test)
}
# 3. Display results
hist(TS) # Histogram
quantile(TS, c(0.025,0.975)) # 95% Confidence interval
length( which(TS>2) )/trials * 2 # p-value
# Note the differences between this and the previous Monte
# Carlo experiments. Also (more importantly) note the
# similarities.
# So, do we reject the null hypothesis and conclude that
# hurricanes hit Alabama at a significantly different rate
# than they hit Mississippi?
# Is reality (TS=2) within the confidnece interval?
# Is our p-value greater than our alpha?
# What does this mean?
# Conclusion:
# The difference in hurricane landfalls, 2, is not significantly
# large enough. Thus, at the alpha=0.05 level, we fail to reject
# the null hypothesis and conclude that hurricanes do not make
# landfall in Alabama at a significantly different rate than
# they do in Mississippi (p = 0.2154).
##################################################
# Extension
# What if you were comparing Alabama landfalls and Florida landfalls?
# What would change? What would you have to do to account for the
# fact that the Florida coastline is much larger than Alabama's?
# Florida's coastline is 1350 miles.
# Alabama's coastline is 100 miles.
# Let us suppose that you ignored this (which may be appropriate,
# depending on your research question/hypothesis).
# In the last decade, a hurricane made direct landfall in Florida
# in six of the 10 years; in Alabama, one.
# H0: Is the probability of a hurricane making landfall Florida
# the same as one making landfall in Alabama?
# (I got the numbers from NOAA. They cover 2000-2010.)
# 1. Define our test statistic: TS = Xa - Xf
# Here, our real TS = 5.
# 2. Determine the distribution of TS (under the null):
set.seed(577)
TS <- numeric()
trials <- 10000
years <- 10
pi <- 0.35 # = 7/20
for(i in 1:trials) {
Xa <- rbinom(1, size=years, prob=pi) # Core (Alabama dist)
Xf <- rbinom(1, size=years, prob=pi) # Core (Florida dist)
TS[i] <- Xa - Xf # Core (the test)
}
hist(TS) # Histogram
quantile(TS, c(0.025,0.975)) # 95% Confidence interval
length( which(TS>5) )/trials * 2 # p-value
# What is our conclusion?
# The difference in hurricane landfalls between Florida and Alabama,
# five, is large enough that we can reject the null hypothesis at
# the alpha=0.05 level and conclude that the probability of a
# hurricane making landfall in Florida is larger than that for
# Alabama (p = 0.0076).