######################################## # # Script: 20 January 2011 (20110120.R) # ######################################## # Today: # # ** Statistical analysis of some data # This includes: # - selecting the appropriate test # - testing the assumptions # - coming to the correct conclusion # # ##### # The Biome2 dataset: # # This will give practice in comparing two means appropriately when there are are # more than two groups in the dataset. This last part is important. # Standard preamble to working with the data fire <- read.csv("http://courses.kvasaheim.com/stat40x3/data/biome2.csv", header=TRUE) names(fire) # Variable names summary(fire) # Statistics on the variables str <- fire$mfri[fire$biome=="STR"] # Create the variable containing mfri for the STR biome xer <- fire$mfri[fire$biome=="XER"] # Create the variable containing mfri for the XER biome # Note: Brackets are a 'selection' operator. They select only those values of the variable # that match the condition. There are other such operators; I will introduce them in # the future. ALSO note: We use two equal signs to indicate logical equal. boxplot(str,xer) # Let us first look at the boxplot comparing the two # groups so we have some idea of the result. That way, if the calculated results are far # from our expectations, we know we did something wrong. This is a CHECK on our scripting. # Checks are ALWAYS good (credit is not allowed). # Here is a better boxplot than the previous one. Play with it to change certain aspects of the # boxplot. GRAPHING is one of the most important things you can do as a researcher; it tells a # story about the data (and relationships) better than numers alone. boxplot(str,xer, las=1, xlab="Biome Type", ylab="Mean Fire Return Interval", names=c("STR","XER")) # Why is this last boxplot much better than the first? # What do each of these parameters in the boxplot function do? # When we need to compare the means of two groups, we tend to think "T-Test". So, let # us first check the assumption of the t-test --- Normality shapiro.test(str) # Not un-Normal shapiro.test(xer) # Not un-Normal # So, the assumption is not violated. Yay! # So, which t-test should we use? The one that assumes equal variances or the one that does not? # To determine this, we check the variances. var.test(str,xer) # Nope, variances are statistically different t.test(str,xer, var.equal=FALSE) # So, we use this t-test # Thus, we can conclude (at the alpha=0.05 level) that the two biomes have statistically # different mean fire return times (t = 3.4736; df = 5.003; p = 0.01776). # Note the wording of the conclusion. ######################################## # Note the steps we followed above: # 1. Get the data in a usable form # 2. Graph the data to get an idea of what the conclusions will be # 3. Decide on a test and check the assumptions # 4. Perform the test # 5. Make the appropriate conclusion. ######################################## # Let us do the same, but with a dataset with a different structure # Standard preamble sri <- read.csv("http://courses.kvasaheim.com/stat40x3/data/sri2010pres.csv", header=TRUE) names(sri) summary(sri) # Do some fun comparisons boxplot(sri$RAJAPAKSE, sri$FONSEKA) boxplot(sri$RAJAPAKSE, sri$REJECTED) boxplot(sri$REJECTED) # Note that comparing vote totals is meaningless. We need to compare the proportion of # the vote cast (not the absolute number of votes cast). # This is calculated as pRpk <- sri$RAJAPAKSE/sri$TOTAL # Proportion of votes cast for Rajapakse pRej <- sri$REJECTED/sri$TOTAL # Proportion of votes rejected as invalid # Here is our research question: Are the proportion of rejected votes # different in those provinces won by Rajapakse than in provinces lost by him? # Note that we need to compare means in two groups. # Also note that the dataset contains a variable specifying which provinces were # won by Rajapakse. Thus, we do not have to create new variables to delineate the # two groups -- sri$WONPROVINCE already does that for us. # So, to test Normality in the two groups: shapiro.test(pRej[sri$WONPROVINCE==0]) # Pass shapiro.test(pRej[sri$WONPROVINCE==1]) # Fail # As the second fails the Normality test, we cannot use the t-test # ...interesting that only in those provinces he won did it fail ... # Since we cannot perform the t-test, we will use a non-parametric test # To compare two means, we use the Mann-Whitney test: wilcox.test(pRej[sri$WONPROVINCE==0], pRej[sri$WONPROVINCE==1]) # This is how you have seen such tests before. However, there is a shortcut we # can use (and should get used to): use a formula. # A formula specifies the response variable (measure) and explanatory variable (grouping) # separated by a tilde ~ as such: wilcox.test(pRej~sri$WONPROVINCE) # For us, right now, formula constructs can be used whenever we have a grouping variable with # only two groups. (And when we are using certain functions). When we start testing more # than two group means, we will relax this requirement. # Sometimes, if the function has a 'data' parameter, we can clarify the statement even more: wilcox.test(pRej~WONPROVINCE, data=sri) # However, not all functions have a 'data' parameter. Fortunately, boxplot() does: boxplot(pRej~WONPROVINCE, data=sri) # which can be prettied-up to give: boxplot(pRej~WONPROVINCE, data=sri, names=c("Lost","Won"), las=1, ylim=c(0,0.05), xlab="Province performance by Rajapakse", ylab="Proportion of the vote rejected") # Thus, we can reject the null hypothesis at the alpha=0.05 level and conclude that # the proportion of rejected ballots in provinces won by Rajapakse is statistically # smaller than in those lost by Rajapakse (W = 4479; p > 0.0001). # Again, notice the steps we went through (and had options with). # Use formulas when you have a grouping variable with two groups and when the # function allows for it. Otherwise, you will need to create variables to use. # Start learning how to make your graphic look better. By "look better" I mean that they # need to convey your point much more easily and conherently. # Practice makes perfect.