##### Demonstration Script #0b ##### MATH322 ##### ##### Testing tests and exploring statistical tests ##### ### Refresher: The Chi-Square distribution with 5 degrees of freedom x = seq(0,20,length=1000) # The x-values y = dchisq(x, df=5) # The density plot(x,y, type="l") # That is a lowercase L, not a 1. ### See how bad the approximation is # From the notes, we discovered that the data-generating # process is Binomial, but the Chi-Square Goodness-of-Fit # test requires it be Poisson. Thus, it is mathematically # incorrect. # BUT, how bad is this simplification? # # This is another example of using Monte Carlo # simulation to learn about statistics ## Roll the fair 3-sided die 100 times (once) # This is the number of 1s that come up Obs = rbinom(1, size=300, prob=1/3) # This is the Chi-Square test statistic for that one experiment X2 = (Obs-100)^2/100 # You cannot know the distribution of that test statistic with # just one value, so we need to do this many, many times... # perhaps a million? Obs = rbinom(1e6, size=300,prob=1/3) X2 = (Obs-100)^2/100 # Now, Obs holds the number of 1s in 1 million experiments, # and X2 holds the test statistics for each of those million # experiments. From that, we can see the distribution ## Here is it in terms of the pdf hist(X2, freq=FALSE) # Here is a better version hist(X2, freq=FALSE, breaks=51) # Now, let us superimpose the Chi-Square distribution x = seq(0,20,length=1000) y = dchisq(x, df=1) # Note the prefix is d for "density," P[X = x] lines(x,y, col="blue") # Wow! That looks pretty close!!! So, even if the Chi-Square # Goodness-of-Fit test is mathematically wrong, it is close # enough to justify the simplifcation. ## Now, let us look at this in terms of the CDF plot(ecdf(X2)) x = seq(0,20,length=1000) y = pchisq(x, df=1) # Note the prefix is now p, for "probability," P[X <= x] lines(x,y, col="red") # Again, the observed (dots) and hypothesized (red line) seem "close," # thus justifying the use of the simplification. ## Maybe another distribution works better? # Try the standard Normal distribution y = pnorm(x) lines(x,y, col="blue") # Try the Poisson distribution y = ppois(x, lambda=1) lines(x,y, col="magenta") ##### ##### ##### ### Yesterday, I left you with the question of What do we mean # when we say the distribution of the sample mean "converges" # to the N(mu; sigma^2/n) distribution? ### We can mean: Largest distance between the two distributions goes to zero. ## To see: As df -> Inf, the t distribution -> the standard Normal # In terms of pdf x = seq(-5,5,length=1e4) yT = dnorm(x) plot(x,yT, type="l", col="blue") for(i in 1:50) { y = dt(x,df=i) lines(x,y, col=grey(i/50)) Sys.sleep(0.25) # Forces the system to wait 0.25 second; allows easy animation } lines(x,yT,col="blue") # In terms of CDF x = seq(-5,5,length=1e4) yT = pnorm(x) plot(x,yT, type="l", col="blue") for(i in 1:50) { y = pt(x,df=i) ## Note the p prefix lines(x,y, col=grey(i/50)) Sys.sleep(0.25) } lines(x,yT,col="blue") ### We can mean other things. When we get to Chapters 6 and 7, we will # see that our choice has consequences if we want to test if two # distributions are "close enough" to be indistinguishable. # # But, I leave that for another day