Introductory Statistics

 

IS: Lab E, Confidence Intervals and Coverage

[icon]
Lab E: Confidence Intervals and Coverage

The main purpose of this laboratory activity is to better understand confidence intervals. A secondary purpose throughout these lab activities is to understand randomness and its effects on statistics. A third is to see how the Central Limit Theorem affects statistics in many surprising ways.

Finally, remember that these are designed to help you master the material. However, this also requires you to think through the steps and seek connections. No assignment will force you to learn. Learning requires a conscious decision and resulting effort on your part.

The Pre-Lab

All of these laboratory activities have pre-labs that must be completed prior to class on lab day (Moodle Quizzes). These pre-labs consist of a brief quiz covering the lab. You need to take the quiz before class on lab days. While it is possible to do well on the quiz without going through and thinking about the lab material, don’t do it. You are cheating yourself.

These pre-labs do not have to be done the night before the lab. I strongly recommend that you do them in the days leading up to the lab. There will frequently be a week between when we cover a topic in class and when there is a lab on it. Do yourself a favor and start early on the pre-labs.

Also, treat the pre-labs as being material to help you prepare. These are not busy work. You should be able to enter class on lab day and teach all of the material in the lab to us in the class. Learning takes time. Remember that you should be willing to spend at least 15 hours a week on the course material. Your primary job is to be a student. Do not cheat yourself.

✦•······················•✦•······················•✦

[moodle icon]
 Moodle

And so, with no further ado, please complete the Moodle quiz and submit it before class begins.

The quiz is a partial check on your reading of this lab. While you may be able to pass the quiz without thoughtfully doing it, you are cheating yourself. Given how much are you spending on your education, be thoughtful.

The Lab

This lab starts in the same way as all do: Start R, start a new script, set the working directory, and source the usual file from that script,

source("http://rfs.kvasaheim.com/stat200.R")

We will be using random numbers in this activity. So, to get you in the habit of doing this (for science to occur, the results must be able to be replicable), you will need to set the pseudo-random number seed. Do this according to your student ID number. That is, if your ID number is 123456, you would next run in your script window

set.seed(123456)

The purpose of this is to allow me to check that you are reporting the numbers correctly. As such, before reporting your final numbers, run the entire script from the beginning. This will ensure that your numbers are the ones I see. This is very important, muy importante, très important, sehr wichtig, etc. If your numbers do not agree with mine, yours are wrong.

Overview

A confidence interval is defined as a set of values which contain the population parameter a specified proportion of the time — when the experiment is performed an infinite number of times. Coverage is defined as the proportion of the time that the confidence interval actually contains the population parameter. Thus, if we are calculating a 95% confidence interval, we would hope the coverage is very close to 95%. If it is not, then our claim of 95% confidence really means little.

Using these definitions, we will test to see how close the real coverage is to the claimed rate for a few of our procedures when the assumptions are met — and when the assumptions are not met.

The Z-procedure

Remember the Z-procedure for the population mean from Lecture d1 (“The Theory of the Z”). The assumptions of (requirements for) this procedure are twofold:

  1. the data are generated from a Normal distribution
  2. the population variance is known

When these requirements are met, we know that the endpoints of a 95% confidence interval are

\( \bar{x} \pm 1.96\ \frac{\sigma}{\sqrt{n}} \)

In the first part of this lab, we will check that coverage is close to 95% when these two assumptions are met. In other words, we will check that the Z-test does what it claims.

Then, in the second part, we will violate the Normality assumption to see the effect on coverage. Hopefully, the effects will not be too severe. If they are not, then that assumption is not that important. If they are severe, then we need to be very sure that the Normality assumption is met.

The third part of the lab has us violating the second assumption. Again, we will check coverage to determine the importance of this assumption. If the coverage is close to 95%, then we can use the Z-test even when we don’t know the population variance, σ². Otherwise, we will need to use some other method.

Finally, the last part of the lab has us violate both requirements. Will the Z-test still be appropriate when both assumptions are violated? Will we need a really large sample size? What can we learn about the Z-test from violating its assumptions?

Ultimately, we wonder: What can we learn about the Central Limit Theorem and the Law of Large Numbers from this lab?

Part I: Violation of Neither Requirement

If we meet the assumptions of the Z-test, then coverage had better be really close to 95%. If it is not, then the test is not appropriate. To check that the Z-test is basically appropriate, we generate our data “under the null hypothesis;” that is, we generate our data according to the null hypothesis.

The following code will generate a set of data from a known Normal distribution, calculate the endpoints of the central confidence interval, determine if the interval contains the population mean, then repeat this many, many, many times (10,000 times, 1e4). When it is all done, we should have approximately 95% of the intervals containing the population mean.

covered = numeric() ## Set aside memory n = 100 ## Sample size mu = 3 ## Known population mean sigma = 4 ## Known population stdev for(i in 1:1e4) { x = rnorm(n, m=mu, s=sigma) ucb = mean(x) + 1.96 * sigma/sqrt(n) lcb = mean(x) - 1.96 * sigma/sqrt(n) covered[i] = isBetween(mu, lcb, ucb) } mean(covered) ## Coverage (1-mean(covered)-0.05)/0.05 ## Relative error

When I ran this code, I got 0.9494. Thus, I estimate the coverage rate as 94.94%. This is very close to our claimed rate of 95%. The relative error is only (0.0506 − 0.05)/0.05 = 1.2%


Is your coverage rate close enough to what you would expect? In this box, type your coverage:

Do you think your coverage rate is sufficiently close to 95%? Explain, using your relative error.

Remember why the question of “sufficiently close” is important. That closeness indicates how true the confidence interval is under these circumstances. Close to 95% is excellent!

As always, your values will differ slightly from mine. Such is the nature of randomness.

A quick discussion on “close enough”:

If I claim the coverage of the confidence interval is 95%, but it really is 90%, is that “close enough”? To answer that, it helps to think in terms of α, which is 1 less the coverage; it is the proportion of the time our claim about the population parameter is wrong. If I claim α = 0.05, but reality is that α = 0.10, is that close enough? No. Why? Think about hypothesis testing. That should give a better appreciation for α and coverage.

And so, my observed Type I error rate was actually a = 1 − 0.9494 = 0.0506. How bad is this? The relative error is (0.0506 − 0.05)/0.05 = 1.2%. What is your relative error when the assumptions are met? It should be very close to zero.

Also, do not forget the rule of thumb for rounding. Round to 2 decimal places when using 10,000 experiments (as here) and to 3 when using 1,000,000.

By the way, the claimed Type I error rate is α (“alpha”). The actual Type I error rate is a (“ay”), which is the Latin version of α.

Part II: Violations of Assumption 1: Normality

In the above code, we found that the Z-procedure worked quite well when its assumptions are both met. Let’s violate the first assumption here, that of Normality. Instead, let’s draw our sample from an Exponential distribution with mean μ = 10 (and rate λ = ?).

Before you see the code, note that there are four things that need to be changed in the code above. First, the mean and standard deviation need to be changed to 10. (Why? Check slidedeck c8.) Also, the data-generating distribution needs to be changed to rexp(100, rate=1/10). (Why is the rate λ = 1/10?) Finally, let us use a sample size of n = 10.

And so, with that, here is the resulting code.

covered = numeric() n = 10 ## Sample size mu = 10 ## Population mean sigma = 10 ## Population stdev for(i in 1:1e4) { x = rexp(n, rate=1/mu) ucb = mean(x) + 1.96 * sigma/sqrt(n) lcb = mean(x) - 1.96 * sigma/sqrt(n) covered[i] = isBetween(mu, lcb, ucb) } mean(covered) ## Coverage (1-mean(covered)-0.05)/0.05 ## Relative error

I have emphasized what needs to be changed in the code above. Experience tells me that errors tend to happen in these places.

When I ran this, I got a coverage of 0.957. Thus, while we claim α = 0.05, our real Type I error rate is closer to a = 0.043, which is a relative error of -14.0%. Our claim is not close to reality.

When is the coverage close enough?

The previous example was for the Exponential distribution with a large variance (σ² = 100) and a small sample size. Let us increase the sample size to see what happens to coverage. Use sample sizes of n = 20, 30, 50, and 100. Calculate the relative error.

This is a table of my results:

Sample Size Coverage Type I Error rate Relative Error
10 0.9570 0.0430 -14.0%
20 0.9520 0.0480 - 4.0%
30 0.9538 0.0462 - 7.6%
50 0.9531 0.0469 - 6.2%
100 0.9507 0.0493 - 1.4%

Depending on where I draw the line between “good enough” and not, I may need to have a sample size of over 50 before I can use the Z-test when the data are not from a Normal distribution. This is quite a bit higher than what textbooks usually suggest, n = 30, as a minimum sample size before you can ignore non-Normality.

I think a relative error of 5% is “close enough” for most of my work. However, if my conclusions could result in death or dismemberment, I would want to use a much smaller boundary… perhaps as low as a 0.1% relative error.


This is a table for your results:

Sample Size Coverage Type I Error rate Relative Error
10 %
20 %
30 %
50 %
100 %

So, given your results, what lessons can you take away from this regarding the “rule of thumb” provided in textbooks regarding n = 30?


Note that your results may not follow a consistent pattern. See mine. A sample size of 20 gave me a result closer to reality than a sample size of 50. This happens because the samples are randomly generated. Remember the rule of thumb on rounding. From the results in my table, it seems as though the sample size needs to be between 50 and 100 to be sufficient (if we use a 5% relative error as the cut-off). Additional experimentation (and a larger number of iterations) would allow me to be more precise about my estimate.

That is, if I want to use the Z-test when the data are from an Exp(λ=0.10), then I need to ensure my sample size is at least 75 (perhaps).

What About a Different Exponential?

Note that the above used an Exponential distribution with a relatively wide spread, σ² = 100. Let’s see how things differ if we choose an Exponential distribution with a smaller variance. Draw your samples from an Exponential with mean 0.10 (and standard deviation = ).

This is a table of my results:

Sample Size Coverage Type I Error rate Relative Error
10 0.9531 0.0469 - 6.2%
20 0.9525 0.0475 - 5.0%
30 0.9509 0.0491 - 1.8%
50 0.9553 0.0447 -10.6%
100 0.9529 0.0471 - 5.8%

This is a table for your results:

Sample Size Coverage Type I Error rate Relative Error
10 %
20 %
30 %
50 %
100 %

Again, note that there is no definite pattern to my numbers. However, because of the Central Limit Theorem (CLT), we would definitely expect the coverage to get closer to 95% as the sample size increases. (Why??)

Part III: Violations of Assumption 2: Knowing σ²

The second assumption in using the Z-procedure is that the population variance σ² is known. This is usually rather untenable, as we do not know the population mean and the variance is a function of the mean.

However, we discovered above that the distribution is largely irrelevant when the sample size is sufficiently large (CLT). I wonder if the same conclusion can be made with regards to using the sample variance in lieu of the population variance. This section investigates this. Before continuing, please review the Law of Large Numbers, which explains why the sample variance converges to the population variance as the sample size inreases.

Again, we will explore coverage and how the sample size affects it. Since we only want to investigate this violation, we need to draw our sample from a Normal distribution. As such, the code is largely the same as those above. There is only one change. Before you move down the page to see the new code, see if you can create it yourself.

Once you have written your code, please move your cursor over the area below to see the code I wrote:

covered = numeric() n = 10 ## Sample size mu = 3 ## Known population mean sigma = 4 ## Known population stdev for(i in 1:1e4) { x = rnorm(n, m=mu, s=sigma) ucb = mean(x) + 1.96 * sd(x)/sqrt(n) lcb = mean(x) - 1.96 * sd(x)/sqrt(n) covered[i] = isBetween(mu, lcb, ucb) } mean(covered) ## Coverage (1-mean(covered)-0.05)/0.05 ## Relative error

What changes were made? Why did we need to make those changes?

We are drawing our sample from a Normal population (requirement 1 is met). But, we are now estimating the population standard deviation σ using the sample standard deviation, s (requirement 2 is violated).

As I change the sample size, I get the following results:

Sample Size Coverage Type I Error rate Relative Error
10 0.9207 0.0793 58.6%
30 0.9442 0.0558 11.6%
50 0.9422 0.0578 15.6%
75 0.9478 0.0522 4.4%
100 0.9442 0.0558 11.6%

This is a table for your results:

Sample Size Coverage Type I Error rate Relative Error
10 %
20 %
30 %
50 %
100 %

What do your results tell you regarding the “rule of thumb” provided in standard introductory statistics textbooks regarding n = 30?

Note that the requirement violated was not a distribution, thus the Central Limit Theorem (CLT) had no part in this part of the lab. Since the requirement was about a value, the Law of Large Numbers is involved. Be able to explain the differences between the two.

Part IV: Violations of Both Assumptions

In this section, we violate both assumptions.

Now, change the sample size to determine the effect of increasing the sample size on your coverage. Remember that coverage should be close to 1 − α = 0.95 and that the observed Type I error rate (a) should be close to α = 0.05.

Here are my results:

Sample Size Coverage Type I Error rate Relative Error
10 0.8682 0.1318 163.6%
30 0.9205 0.0795 59.0%
50 0.9246 0.0754 50.8%
100 0.9380 0.0620 24.0%
200 0.9482 0.0518 3.6%

It really looks like I would need at least 150 to 200 in my sample to use the Z-procedure when the data are not Normal and the population variance is unknown.

To get a better estimate, I would need to increase the number of iterations (precision) and repeat the experiment for several different values for the sample size.


Here is a table for your results:

Sample Size Coverage Type I Error rate Relative Error
10 %
30 %
50 %
100 %
200 %

What are your results? What lessons can you take away from this regarding the “rule of thumb” provided in your textbook regarding n = 30? Is even n = 50 enough under these circumstances/conditions? What about n = 100?

The Post-Lab

Please answer the following post-lab questions as usual.

  1. What role did the Central Limit Theorem play in this activity? What role did the Law of Large Numbers play in this activity? Was the usual rule of thumb[1] typically given in textbooks sufficient, or would you recommend a higher cutoff between “good enough” and “not good enough” when the data are non-Normal and you do not know σ? If so, what would it be?
    As always, explain fully.
  2. The Exponential distribution used in this activity is highly right-skewed (H = 0.307). Would the rule of thumb you created in #1 be different if the data came from a symmetric distribution like the standard Uniform distribution? If so, why? Would it increase or decrease?
    Explain fully using statistics and what you know of the two distributions. I want to see how well you understand the two distributions and the effect of symmetry on the Central Limit Theorem. I also want to check that you are able to use R to determine the answer (as you need to).
  3. For many years, the t-procedure did not exist; it was created in 1908 by William Sealy Gosset. Before then, all of the means testing was done using the Z-procedure. It was not until the turn of the 20th century that someone (a managing brewer for Guiness Brewery in Ireland) decided the Z-procedure was not acceptable for his needs. Gossett then devised the t-procedure.
    From the results of this laboratory activity, was it more likely that Gossett dealt with small sample sizes or large? Explain your reasoning.

Remember that the post-lab is based on correctness as well as your ability to express yourself well. Spend time making sure that there are no errors. Include actual values from your analysis to support each of your answers. This last is important for all three questions. Without including the statistics you calculated, you are not grounding your answers in reality, and your grade will reflect that.

Make sure you include your script from this lab— properly commented. Your script should include what is in the lab as well as what you do to answer question 2. No script = No points. Your script also needs to be commented and easily read.

As always: your first page is the title page. Your title page should include your name, the lab title, the date of the lab, and your Knox ID. Start a new page and answer the post-lab questions. After you answer those questions, start a new page and start your code appendix.

Footnote


Reference:

[1] The “usual rule of thumb” states that the sample means are essentially Normally distributed as long as the sample size is at least 30.