Introductory Statistics

 

IS: Lab C, The Central Limit Theorem

[icon]
Lab C: The Central Limit Theorem

The purpose of this laboratory activity is to better understand the relationships between continuous distributions and the phenomena they describe. A secondary, yet persistent, purpose throughout these lab activities is to understand randomness and the distribution of certain sample statistics.

Remember that these are designed to help you master the material. However, this also requires you to think through the steps and seek connections. No assignment will force you to learn. Learning requires a conscious decision and resulting effort on your part.

The Pre-Lab

All of these laboratory activities have pre-labs that must be completed prior to class on lab day (Moodle Quizzes). These pre-labs consist of a brief quiz covering the lab. You need to take the quiz before class on lab days. While it is possible to do well on the quiz without going through and thinking about the lab material, don’t do it. You are cheating yourself.

These pre-labs do not have to be done the night before the lab. I strongly recommend that you do them in the days leading up to the lab. There will frequently be a week between when we cover a topic in class and when there is a lab on it. Do yourself a favor and start early on the pre-labs.

Also, treat the pre-labs as being material to help you prepare. These are not busy work. You should be able to enter class on lab day and teach all of the material in the lab to us in the class. Learning takes time. Remember that you should be willing to spend at least 15 hours a week on the course material. Your primary job is to be a student. Do not cheat yourself.

✦•······················•✦•······················•✦

[moodle icon]
 Moodle

And so, with no further ado, please complete the Moodle quiz and submit it before class begins.

The quiz is a partial check on your reading of this lab. While you may be able to pass the quiz without thoughtfully doing it, you are cheating yourself. Given how much are you spending on your education, be thoughtful.

The Lab

This lab starts in the same way as all do: Start R, start a new script, set the working directory, and source the usual file from that script,

source("http://rfs.kvasaheim.com/stat200.R")

We will be using random numbers in this activity. So, to get you in the habit of doing this (for science to occur, the results must be able to be replicable), you will need to set the pseudo-random number seed. Do this according to your student ID number. That is, if your ID number is 123456, you would next run in your script window

set.seed(123456)

The purpose of this is to allow me to check that you are reporting the numbers correctly. As such, before reporting your final numbers, run the entire script from the beginning. This will ensure that your numbers are the ones I see. This is very important, muy importante, très important, sehr wichtig, etc. If your numbers do not agree with mine, yours are wrong.

* * *

In the previous lab, we calculated observation intervals. Recall that these are intervals on data already experienced (measured). Thus, one would use an observation interval to describe collected data. A prediction interval is an interval on future observations. So, one may use prediction intervals to plan for a future observation.

In this lab, we will focus on the three types of intervals:

For this lab, I have created a function that helps with calculating these three types of intervals. It is the interval function. You will see its using in the lab.

Part I: The Uniform Distribution

This part looks at the effect of the sample size on the precision of our estimates. Let us start by drawing a sample of size 100 from a Uniform distribution. In fact, let us draw our sample from a standard Uniform: Unif(0,1). Once we have that sample, let us look at the distribution by creating a histogram (empirical pdf) of the variable.

x = runif(100, min=0, max=1) hist(x, freq=FALSE)

Note that the histogram is very jagged. It does not look too close to the probability density function (pdf) of a standard Uniform distribution. The following lines draws this distribution over the above histogram to help with comparison.

xx = seq(0, 1, length=1000) yy = dunif(xx, min=0, max=1) lines(xx, yy, col="red")

Why does the first Uniform distribution (histogram) not follow the true Uniform distribution (red line)? What can we do to make it follow more closely?

Increase Precision

The answer to the second question is to increase the sample size. Increasing sample sizes will always increase precision. Try

x = runif(10000, min=0, max=1) hist(x, freq=FALSE)

The observed distribution is still not the true distribution (population); however, it is much closer. Finally, try the same, but with a sample size of a million (1e6). Does the observed distribution approach the theoretical distribution as the sample size increases? That is an important lesson in statistics.

By the way, this is a great place to check the quality of your computer. Weaker computers will have trouble handling the million values… especially when creating a graphic of them. My own laptop will pause for a few moments before creating that histogram. For mine, it takes about 30s for 1 × 108 values.

Increasing the sample size tends to increase the precision of the estimates.

Part II: The Exponential Distribution

The previous part looked at the effect of sample size on precision. This section looks at comparing the three interval types. To illustrate the differences between the three interval types, observation, prediction, and confidence, we will make use of the Exponential distribution. This distribution is a great because it is highly skewed (H ≈ 0.307).

And so, let us run the following code. You should be able to tell what the first line does and hypothesize about the other three lines. Doing this will help slow you down and focus on similarities and differences in the analyses. Do not just run through this lab to “get it over with!” Use this time wisely.

x = rexp(1e6, rate=1) ## lambda = 1 interval(x, type="observation") ## observation interval interval(x, type="prediction") ## prediction interval interval(x, type="confidence") ## confidence interval

The first line generates (simulates) a million values from an Exponential distribution with rate λ = 1 and stores them in the variable x. The second line calculates a 95% observation interval; the third, the 95% prediction interval; and the final, the 95% confidence interval.

Feel free to fill in your results in a table like this.

Interval Type My Results Your Results
Observation 0.025 to 3.69
Prediction -0.96 to 2.96
Confidence 0.99 to 1.01

Because the sample size is so large, your values should be quite close to mine.

A Second Exponential

Let us do this with a second Exponential distribution to see if the ordering of interval widths changes. Let x follow an Exponential distribution with mean μ = 10. (If μ = 10, what is the rate λ?) Generate a sample of size n = 1,000,000 from this distribution.

Type the code in the following box:

Now, with that code (assuming it is correct), calculate the three intervals and tabulate your results. Rank them from narrowest to widest.

Interval Type My Results Your Results
Observation 0.25 to 36.84
Prediction -9.57 to 29.56
Confidence 9.97 to 10.01

Again, because the sample size is so large, your values should be quite close to mine.

Based on the results of this brief experiment, what conclusions can you draw about interval widths? Which is widest? Which is narrowest?

What does the confidence interval estimate?

What does the observation interval estimate?

What does the prediction interval estimate?

Part III: The Sample Means

This last part of this lab deals with the distribution of the sample means. Note that the data are random (they have a distribution). This indicates that any function of the data is also a random variable… ilke the sample mean. Because of this, the sample mean has a distribution.

This part of the lab looks at that distribution for some different data distributions.

The Second Exponential Distribution

Generate 1,000,000 values from an Exponential random variable with mean μ = 10. (As above, what is λ?) Store these million values into the variable T. Now that we have a very large sample from this distribution, let us look at the histogram of those values. This will be a very close approximation of this distribution.

hist(T, freq=FALSE, xlim=c(0,50))

This is the histogram I got. Check that yours looks similar.

[Dist of the data]

Note that the histogram (empirical pdf) looks very much like an Exponential distribution. That is because it is.

The Sampling Distribution of the Mean, n=2

Now, let us look at the distribution of the sample mean when the sample size is 2. In other words, one experiment is to measure the mean of two Exponentially distributed random values, and we repeat that experiment 500,000 times. Here is the code:

sm2 = getMeans(T,2) hist(sm2)

The getMeans(T,2) function calculates the means, where the data are from the T data and the sample size for the means is 2. Thus, to get a histogram of the sample means when the sample size is 3, you would run sm3 = getMeans(T,3) followed by hist(sm3).

Here is my resulting histogram of the distribution of sample means with n = 2.

[Dist of Sample Means 2]

Note that this distribution is not the same as that of the data. The data followed an Exponential distribution. The sample means do not.

The Sampling Distribution of the Mean, n=3

Here is my resulting histogram of the distribution of sample means with n = 3. Compare the shape of this histogram with that of the original data and that of sample means with n = 2.

[Dist of Sample Means 3]

Note that its shape is not the same as that of the data. The data followed an Exponential distribution. The sample means do not.

Sampling Distribution, n = 1 → 30

Instead of doing this individually for a whole lot of histograms, I used R to create the following animation. Each frame is a histogram of the sample means, but for different sample sizes.

[CLT effect]

Note that the histograms start out looking like the data (n = 1). However, as the sample size increases, the histograms look much more like a Normal distribution. This is a powerful observation! It means that the distribution of the data is not relevant to the distribution of the sample means, as long as the sample size is large enough.

Compare the distributions (histograms) of the original data, the sample means with n=2, the sample means with n=3, etc. What happens to the distribution of the sample means with the sample size increases?

Finally, what happens to the observation interval for the sample means as the sample size increases?

Note that the “observation interval for the sample means” is the confidence interval.

The Post-Lab

That is all there is to this third laboratory activity. Go back over it and try to summarize what you learned in four sentences (or so) in your notes. This will help you optimize learning here. It will also help you by showing where you are unsure and (therefore) should return and relearn the material.

Education is expensive.

These are the three post-lab questions:

  1. Describe the Exponential distribution in terms of sample space and skew. Among observation, prediction, and confidence, which is narrowest?
  2. Repeat Part II, using a Uniform(0, 1) distribution. Is the ordering of interval width the same for this distribution, or are your conclusions based solely on the distribution type?
  3. As the sample size (number of values added to obtain the mean) increases, what distribution does the distribution of the sample means approach? Which of the following data distributions will have its sample means converge fastest: Uniform, Exponential, or Normal? Explain.

Note that two of the questions require additional experimentation.

Remember that the post-lab is based on correctness as well as your ability to express yourself well. Spend time making sure that there are no errors. Include actual values from your analysis to support each of your answers. This last is important. Without including the statistics you calculated, you are not grounding your answers in reality, and your grade will reflect that.

Make sure you include your script from this lab— properly commented. Your script should include what is in the lab as well as what you do to answer questions 2 and 3. No script = No points.

As always: your first page is the title page. Your title page should include your name, the lab title, the date of the lab, and your Knox ID. Start a new page and answer the post-lab questions. After you answer those questions, start a new page and start your code appendix.