IS‘25: IS: SCA 07a

SCA 07a: Distribution of the Sample Mean

Purpose

This SCA is slightly different from our previous activities. In the past, you have spent your time being exposed to R functions and how to manipulate them. If you have been following along, you now have the skills to generate random samples from many different distributions, calculate sample statistics from those samples, and calculate probabilities.

This activity has you examine the distribution of the sample mean. To do so, you will need to do the following four steps:

generate a random sample from a distribution,
calculate the sample mean for that distribution,
save that value into a variable for further use, and
repeat these steps many, many, many times.

At the end, you will have a variable filled with means calculated on different samples.

This SCA complements Laboratory Activity C. The Central Limit Theorem is the most important theorem in statistics. Make sure you understand every single part of it.

Functions

In this SCA, we will be using the following functions in R. It is useful to keep track of where you were introduced to the functions. By the end of the SCA, you should be able to explain what these functions do. Be clear that these functions allow us to modify the most minute details of graphics produced by R. When performing the same analyses over and over again, requiring a new graphic each time, these are very important!

for
numeric
overlay*
rnorm
rexp
shapiroTest*

Those functions with a * are only available if this line is run in your script:
source("http://rfs.kvasaheim.com/stat200.R")
Note that you should not include the * when using those functions.

The SCA Procedure

As usual, here is the procedure. Make sure you understand that the goal is not for you to get to the finish line. The goal is for these steps to help you better understand how to achieve the purposes and goals listed above. Racing through these without savoring them wastes your precious time.

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

Start R and open a new script in your SCA7 folder. Title this script “sca07a.R” or something like that.
Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
source("http://rfs.kvasaheim.com/stat200.R")
When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
There is no need to load any data. We will generate our own data using random number generators.

Part I: The Overview

As you have already discovered, statistics concerns more than just learning about the sample. More importantly, it concerns learning something about the population. To learn about the population, we calculate statistics on the sample and use our knowledge of the distribution of that statistic to better understand the population parameter.

This part shows how to better understand the distribution of the sample mean. The process can easily be generalized to different data generating function and different statistics. We will see this later in this SCA.

The first step is to select a distribution (data-generating function) and a sample size that produce our data. We will know all aspects of this population so that we can see how our sample statistic compares to the population statistic. Without loss of generality, let us specify that the random variable is generated from a Normal process with mean 0 and standard deviation 1. That is, let \(X \sim N(0, 1)\). While we are here, let us also specify that the sample size is n = 10. x = rnorm(10, m=0, s=1)
Next, let us calculate the sample mean from this sample and save it in the variable mn. mn = mean(x)
Now, the variable mn holds the sample mean from this particular sample. To see the sample, enter x. To see the sample mean, enter mn.
Since we want to explore the distribution of the sample means, we need more than one value. So, let us repeat the above two lines 10 times. What is the distribution of mn?
Oops! Each time you run those two lines, the value in mn is overwritten by the new values. So, at the end of running those lines 10 times, you only have one value in mn. We really want R to save the new sample mean without losing the previous values.
To accomplish this, we need to change the variable mn into a vector and specify which element of the vector the new sample mean will be stored in. Here are the two lines that accomplish this: mn = numeric() mn[i] = mean(x)
We also need to tell R to loop through two of the lines multiple times. We will use the for function to accomplish that. The for function specifies the index and the range you want the index to sequence through.
And so, without further ado, here is the entire code to generate means from B = 1000 samples, with each of size n = 10 drawn from a Normal distribution with \(\mu=0\) and \(\sigma=1\) mn = numeric() for(i in 1:1000) { x = rnorm(10, m=0, s=1) mn[i] = mean(x) } Since this code is so important, let us refer to this code block as THE RING. I will let you draw your own conclusions about my taste in movies.
After you run those lines, the variable mn will contain 1000 sample means. Thus, we can view its distribution: hist(mn) or, if you think mn should follow a Normal distribution, overlay(mn) We can calculate its mean and variance, mean(mn) var(mn) We can determine whether it is “sufficiently” skewed, hildebrand.rule(mn) More importantly, we can test if the distribution of the sample means actually does follow a Normal distribution, shapiroTest(mn) Note that the ideas behind the Shapiro-Wilk test are beyond the course at this point. We will see them later.
To determine if the distribution of the sample means is “sufficiently” Normal, just look at the “p-value” in the output. If the p-value is greater than 0.05, then the sample means are sufficiently Normal.
Most likely, your p-value will be greater than 0.05. There is a chance that it is not (a 5% chance, in fact), so be aware of this. If you get a p-value that is too small, feel free to run The Ring again and re-check the results from the Shapiro-Wilk test. Most surely, the p-value will now be greater than 0.05.
Since we are here, let me show you the default overlay plot. It provides the histogram of the measured sample means. It puts a Normal curve to that histogram. And, in the Console window, it gives the mean and standard deviation.

As always, we could make the graphic spiffier, but this default really tells us the story of the distribution of the sample means: The really do look Normal. The average of those sample means is really close to \(\mu = 0\). The standard deviation of those sample means is close to \(1/\sqrt{n} \approx 0.32\).

Summary: And so, what this part has shown us is that the distribution of sample means appears Normal with expected value \(\mu\) and variance \(\sigma^2 / n\). This is a powerful result, especially if we are looking to understand the population mean… which we frequently do.

Part II: The First Extension

In the previous part, you discovered that the sample means are Normally distributed with mean equal to that of the population and standard deviation equal to that of the population divided by the square root of the sample size.

B’gosh and begorra! None of that is surprising because of the Central Limit Theorem and how we defined the mean and standard deviations. If the data are generated by a Normal process, then the sample means will be Normally distributed. BUT, what if the data do not come from a Normal distribution? What can we know about the distribution of the sample means? Well, let’s find out!

Let’s start with The Ring and modify it to match our new data-generating process: mn = numeric() for(i in 1:1e3) { x = rnorm(10, m=0, s=1) mn[i] = mean(x) } Recall that we used The Ring to better understand the distribution of the sample mean when the data are generated by a Normal process with mean \(\mu=0\) and standard deviation \(\sigma=1\). Find the place in The Ring where we specified that the data come from a N(0, 1) process.
If we specify a different data-generating process, that will be the one and only line we need to alter. So, if the data come from an Exp(\(\lambda=1/2\)) process, this will be the changeable line: x = rexp(10, rate=1/2) If the data come from a Uniform distribution with minimum 4 and maximum 15, the changeable line will be x = runif(10, min=4, max=15)If the data come from a Binomial distribution with size 10 and success probability 0.50, the changeable line will be x = rbinom(10, size=10, prob=0.50)
And so, let us examine the distribution of the sample mean when the data are generated from an Exponential process with mean 4. (Let us keep the sample size at 10.)
Here is the new ring. Let us call it The Sandlot. mn = numeric() for(i in 1:1e3) { x = rexp(10, rate=1/4) mn[i] = mean(x) }
Run the Sandlot and determine the mean and standard deviation of the sample mean. Hopefully, you will not be surprised that the mean of those sample means is close to 4 (the mean of X) and that the standard deviation of those sample means is close to \(4/\sqrt{n}\).
You also may not be surprised that the distribution of the sample means is not sufficiently Normal. The Shapiro-Wilk test should return a p-value less than 0.05. What this means is the following:
If the data are generated from an Exponential process and the sample size you are using to estimate the population mean is 10, then the sample means are not sufficiently Normal.
So, if the sample means need to be Normally distributed for some analysis (in the future), and if the data are generated from an Exp(\(\lambda=0.25\)) process, then the sample size needs to be larger than n = 10.
How large??? Let us examine that in the next part. ☺
For the record, here is the approximate distribution of the sample means. You know how to create this histogram. Your distribution should be quite close.

You should also note that the distribution of the sample means is much more Normal than the distribution of the data.

STOP! Take a deep breath and read back through the previous section. Review what you learned and what the activity suggests you will be doing in the next section. This is what you need to do to extract the maximum benefit from these SCAs. Again, the objective is to learn not to finish.

Part III: The Second Extension

So, to ask again our previous question, what sample size is large enough so that the sample mean is sufficiently Normal when the data are generated from an Exp(λ=0.25) distribution? This section explores this question.

Basically, you will be running The Sandlot many times, varying the sample size each time, testing to see if the Shapiro-Wilk test consistently returns a p-value greater than 0.05. Here is the test for a sample size of 30:


mn = numeric()

for(i in 1:1e3) {
  x = rexp(30, rate=1/4)
  mn[i] = mean(x)
}

For the time that I run this, the Shapiro-Wilk test indicates that the sample means are not sufficiently Normal (p = 0.0001844). Here is the histogram (with Normal overlay) for the distribution of sample means:

[normoverlay]

Because the samples are random, I reran The Sandlot about a dozen times. None of the p-values were greater than \(\alpha = 0.05\). So, a sample size of 30 is not sufficient to ensure that the sample means are Normally distributed.
To check if n = 50 is sufficient for this Exponential distribution, I changed the sample size and reran the Sandlot about a dozen times. Out of those dozen times, only once was the p-value greater than \(\alpha = 0.05\). The other times, it was less. So, n = 50 is close, but it is not quite there. What about a sample size of 100? Will that be sufficient?
Check. Change the Sandlot appropriately, run the Shapiro-Wilk test, compare the p-value to 0.05. If you consistently (not always, but about 5% of the time) get a p-value greater than \(\alpha = 0.05\), that is an acceptable sample size to ensure the sample means are close to having a Normal distribution.

Remember: The purpose of this SCA is to better understand the Central Limit Theorem. In many textbooks, like ours, there is a “Rule of Thumb” that a sample size of at least 30 is sufficiently large for the means to be “Normal enough.” When drawing from an Exp(λ=1/4) distribution, we saw that this was not the case. What was the Rule of Thumb in that case? It looks to be n = 100.


	Ole J. Forsberg, PhD Associate Professor Chair of Data Sciences Office: SMC E-219 (and OM 105) Knox College 2 East South Street Campus Box K-6 Galesburg, IL, USA, 61401-4999	Some Links Knox College of Illinois Department: Mathematics Program: Data Science Program: Statistics R for Starters Project Scarlet Elections @Knox