########################################
#
#  Script: 20 January 2011 (20110120.R)
#
########################################


# Today:
#
#   ** Statistical analysis of some data
#      This includes: 
#      - selecting the appropriate test
#      - testing the assumptions
#      - coming to the correct conclusion
#
#

##### 
#  The Biome2 dataset:
#
#  This will give practice in comparing two means appropriately when there are are
#  more than two groups in the dataset. This last part is important. 

# Standard preamble to working with the data
fire <- read.csv("http://courses.kvasaheim.com/stat40x3/data/biome2.csv", header=TRUE)
names(fire)                           # Variable names
summary(fire)                         # Statistics on the variables 
str <- fire$mfri[fire$biome=="STR"]   # Create the variable containing mfri for the STR biome
xer <- fire$mfri[fire$biome=="XER"]   # Create the variable containing mfri for the XER biome
# Note: Brackets are a 'selection' operator. They select only those values of the variable
#       that match the condition. There are other such operators; I will introduce them in 
#       the future. ALSO note: We use two equal signs to indicate logical equal. 


boxplot(str,xer)                      # Let us first look at the boxplot comparing the two
#       groups so we have some idea of the result. That way, if the calculated results are far
#       from our expectations, we know we did something wrong. This is a CHECK on our scripting.
#       Checks are ALWAYS good (credit is not allowed).

# Here is a better boxplot than the previous one. Play with it to change certain aspects of the
# boxplot. GRAPHING is one of the most important things you can do as a researcher; it tells a
# story about the data (and relationships) better than numers alone. 
boxplot(str,xer, las=1, xlab="Biome Type", ylab="Mean Fire Return Interval", names=c("STR","XER"))
# Why is this last boxplot much better than the first? 
# What do each of these parameters in the boxplot function do?


# When we need to compare the means of two groups, we tend to think "T-Test". So, let
# us first check the assumption of the t-test --- Normality
shapiro.test(str)   # Not un-Normal
shapiro.test(xer)   # Not un-Normal
# So, the assumption is not violated. Yay!

# So, which t-test should we use? The one that assumes equal variances or the one that does not?
# To determine this, we check the variances.
var.test(str,xer)                 # Nope, variances are statistically different



t.test(str,xer, var.equal=FALSE)  # So, we use this t-test

# Thus, we can conclude (at the alpha=0.05 level) that the two biomes have statistically
# different mean fire return times (t = 3.4736; df = 5.003; p = 0.01776).

# Note the wording of the conclusion.



########################################

# Note the steps we followed above:
#  1. Get the data in a usable form
#  2. Graph the data to get an idea of what the conclusions will be
#  3. Decide on a test and check the assumptions
#  4. Perform the test
#  5. Make the appropriate conclusion.

########################################



# Let us do the same, but with a dataset with a different structure

# Standard preamble
sri <- read.csv("http://courses.kvasaheim.com/stat40x3/data/sri2010pres.csv", header=TRUE)
names(sri)
summary(sri)

# Do some fun comparisons
boxplot(sri$RAJAPAKSE, sri$FONSEKA)
boxplot(sri$RAJAPAKSE, sri$REJECTED)
boxplot(sri$REJECTED)

# Note that comparing vote totals is meaningless. We need to compare the proportion of
# the vote cast (not the absolute number of votes cast). 

# This is calculated as
pRpk <- sri$RAJAPAKSE/sri$TOTAL   # Proportion of votes cast for Rajapakse
pRej <- sri$REJECTED/sri$TOTAL    # Proportion of votes rejected as invalid

# Here is our research question: Are the proportion of rejected votes
# different in those provinces won by Rajapakse than in provinces lost by him?

# Note that we need to compare means in two groups. 

# Also note that the dataset contains a variable specifying which provinces were
# won by Rajapakse. Thus, we do not have to create new variables to delineate the
# two groups -- sri$WONPROVINCE already does that for us.

# So, to test Normality in the two groups:
shapiro.test(pRej[sri$WONPROVINCE==0])   # Pass
shapiro.test(pRej[sri$WONPROVINCE==1])   # Fail
# As the second fails the Normality test, we cannot use the t-test 
# ...interesting that only in those provinces he won did it fail ...

# Since we cannot perform the t-test, we will use a non-parametric test

# To compare two means, we use the Mann-Whitney test:
wilcox.test(pRej[sri$WONPROVINCE==0], pRej[sri$WONPROVINCE==1])
# This is how you have seen such tests before. However, there is a shortcut we
# can use (and should get used to): use a formula. 

# A formula specifies the response variable (measure) and explanatory variable (grouping) 
# separated by a tilde ~ as such:
wilcox.test(pRej~sri$WONPROVINCE)

# For us, right now, formula constructs can be used whenever we have a grouping variable with
# only two groups. (And when we are using certain functions). When we start testing more 
# than two group means, we will relax this requirement.

# Sometimes, if the function has a 'data' parameter, we can clarify the statement even more:
wilcox.test(pRej~WONPROVINCE, data=sri)
# However, not all functions have a 'data' parameter. Fortunately, boxplot() does:
boxplot(pRej~WONPROVINCE, data=sri)
# which can be prettied-up to give:
boxplot(pRej~WONPROVINCE, data=sri, names=c("Lost","Won"), las=1, ylim=c(0,0.05), xlab="Province performance by Rajapakse", ylab="Proportion of the vote rejected")

# Thus, we can reject the null hypothesis at the alpha=0.05 level and conclude that
# the proportion of rejected ballots in provinces won by Rajapakse is statistically
# smaller than in those lost by Rajapakse (W = 4479; p > 0.0001).



# Again, notice the steps we went through (and had options with).

# Use formulas when you have a grouping variable with two groups and when the
# function allows for it. Otherwise, you will need to create variables to use.

# Start learning how to make your graphic look better. By "look better" I mean that they
# need to convey your point much more easily and conherently.

# Practice makes perfect.