##### Demonstration Script #0b
##### MATH322
#####
##### Testing tests and exploring statistical tests
#####


### Refresher: The Chi-Square distribution with 5 degrees of freedom
x = seq(0,20,length=1000)    # The x-values
y = dchisq(x, df=5)          # The density
plot(x,y, type="l")          # That is a lowercase L, not a 1.



### See how bad the approximation is
# From the notes, we discovered that the data-generating
# process is Binomial, but the Chi-Square Goodness-of-Fit 
# test requires it be Poisson. Thus, it is mathematically
# incorrect.
# BUT, how bad is this simplification?
#
# This is another example of using Monte Carlo
# simulation to learn about statistics


## Roll the fair 3-sided die 100 times (once)

# This is the number of 1s that come up
Obs = rbinom(1, size=300, prob=1/3)

# This is the Chi-Square test statistic for that one experiment
X2 = (Obs-100)^2/100

# You cannot know the distribution of that test statistic with
# just one value, so we need to do this many, many times... 
# perhaps a million?

Obs = rbinom(1e6, size=300,prob=1/3)
X2 = (Obs-100)^2/100

# Now, Obs holds the number of 1s in 1 million experiments,
# and X2 holds the test statistics for each of those million 
# experiments. From that, we can see the distribution


## Here is it in terms of the pdf
hist(X2, freq=FALSE)

# Here is a better version
hist(X2, freq=FALSE, breaks=51)

# Now, let us superimpose the Chi-Square distribution
x = seq(0,20,length=1000)
y = dchisq(x, df=1)         # Note the prefix is d for "density," P[X = x]
lines(x,y, col="blue")

# Wow! That looks pretty close!!! So, even if the Chi-Square
# Goodness-of-Fit test is mathematically wrong, it is close
# enough to justify the simplifcation.



## Now, let us look at this in terms of the CDF

plot(ecdf(X2))

x = seq(0,20,length=1000)
y = pchisq(x, df=1)         # Note the prefix is now p, for "probability," P[X <= x]
lines(x,y, col="red")

# Again, the observed (dots) and hypothesized (red line) seem "close,"
# thus justifying the use of the simplification.


## Maybe another distribution works better?

# Try the standard Normal distribution
y = pnorm(x)
lines(x,y, col="blue")

# Try the Poisson distribution
y = ppois(x, lambda=1)
lines(x,y, col="magenta")




#####
#####
#####

### Yesterday, I left you with the question of What do we mean
#   when we say the distribution of the sample mean "converges"
#   to the N(mu; sigma^2/n) distribution?


### We can mean: Largest distance between the two distributions goes to zero.


## To see: As df -> Inf, the t distribution -> the standard Normal

# In terms of pdf

x = seq(-5,5,length=1e4)
yT = dnorm(x)
plot(x,yT, type="l", col="blue")

for(i in 1:50) {
  y = dt(x,df=i)
  lines(x,y, col=grey(i/50))
  Sys.sleep(0.25)                   # Forces the system to wait 0.25 second; allows easy animation
}

lines(x,yT,col="blue")





# In terms of CDF

x = seq(-5,5,length=1e4)
yT = pnorm(x)
plot(x,yT, type="l", col="blue")

for(i in 1:50) {
  y = pt(x,df=i)              ## Note the p prefix
  lines(x,y, col=grey(i/50))
  Sys.sleep(0.25)
}

lines(x,yT,col="blue")




### We can mean other things. When we get to Chapters 6 and 7, we will
#   see that our choice has consequences if we want to test if two 
#   distributions are "close enough" to be indistinguishable.
#   
#   But, I leave that for another day