The Lab
This lab starts in the same way as all do: Start R
, start a new script, set the working directory, and source the usual file from that script,
source("http://rfs.kvasaheim.com/stat200.R")
We will be using random numbers in this activity. So, to get you in the habit of doing this (for science to occur, the results must be able to be replicable), you will need to set the pseudo-random number seed. Do this according to your student ID number. That is, if your ID number is 123456
, you would next run in your script window
set.seed(123456)
The purpose of this is to allow me to check that you are reporting the numbers correctly. As such, before reporting your final numbers, run the entire script from the beginning. This will ensure that your numbers are the ones I see. This is very important, muy importante, très important, sehr wichtig, etc. If your numbers do not agree with mine, yours are wrong.
Overview
An estimator is a function that estimates a population parameter. An estimate is a value calculated from the data using the estimator. We have already dealt with estimates and estimators many times. We have used the sample mean as our estimator of the population mean. We have also used the sample proportion as our estimator of the population proportion and the sample variance as our estimator of the population variance. Why do you think we used these estimators? What makes a good estimator?
In this lab, we will define what we mean by a “good” estimator and then determine advantages in using different estimators. In all cases in this lab, we will be using a symmetric distribution to explore bias and precision.
Part I: The Bias
One way of comparing the quality of two estimators is to compare their biases. The bias is defined as the expected difference between the estimator and the parameter. In other words, it is the average of the estimation errors. All things being equal, one would hope for an estimator with no bias — an unbiased estimator.
Estimating the Population Mean
The first set of examples tries to answer the question:
- Why do we use the sample mean to estimate the population mean?
To try to answer this question, we focus on the bias of the estimators. The three estimators are the sample mean, sample median, and sample midpoint. Logically, all three should give us good estimators. But, do they?
Sample Mean
Let us start investigating the sample mean as an estimator of the population mean. What do these two lines do?
x = rnorm(100, m=7, s=3)
mean(x)
If you run these two lines, you will discover that the mean of the sample is not the mean of the distribution. The sample mean I got was 7.050459. The population mean is 7. The error is defined as “observed minus expected,” 7.050459 − 7 = 0.050459.
That is the error for that one sample.
The bias is the average error.
To estimate the average error, we need to repeat the previous code many, many, many times so that we have many, many, many errors. That will allow us to better understand the error.
Here is an estimate of the bias of the sample mean estimator. Make sure you understand why this estimates the average error.
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = mean(x)
}
mean(estimates)
Running this gives me a value of 6.998287. So, the estimated bias is that value minus the population mean. The estimated bias is -0.001713, which is quite small. For investigations into bias, a rule of thumb is to report two digits if using 104 iterations and three if using 106 iterations. Thus, the estimated bias to two decimal places is 0.00. That is, the estimator appears to be unbiased.
Of course, there is a mathematical way of proving that the sample mean is an unbiased estimator. However, I am teaching you how to estimate bias under all circumstances. Mathematical proofs give exact results, but only for a limited subset of problems.
Sample Median
Since the Normal distribution is symmetric, we could also use the sample median as an estimator of the mean. Here is the code to do that.
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = median(x)
}
mean(estimates)
Only one line is changed. When I ran this, I got an estimate of 7.002383, which leads to an estimated bias of +0.002383. Again, since there are 104 iterations, we should only report to two decimal places… the estimated bias using the median is 0.00. It, too, is apparently unbiased. [Note: Again, this could be proven mathematically. However, we could not prove things mathematicially if the data were Exponential or Uniform. These “Monte Carlo” techniques are much more general.]
Sample Midpoint
There is a third estimator of the population mean of a symmetric distribution that I can think of— the sample midpoint. The midpoint is the average of the maximum and minimum values in a data set. The function to calculate the maximum value is max
, and the function to calculate the minimum value is min
. Here is most of the code. Before you move your cursor over the blank space, figure out what should go there.
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = 0.5*(max(x) + min(x))
}
mean(estimates)
You should again conclude that the midpoint estimator is apparently unbiased (or almost unbiased).
Note, however, that this result cannot be proven mathematically. It seems rather easy to prove, but because the Normal distribution has a domain from -∞ to +∞, things that seem easy to prove are not.
Summary
And so, this section defined bias, showed how to estimate it, and provided three estimators of the population mean when the data are generated from a Normal distribution. This does not seem to answer our original question, however. In terms of bias, all three estimators are equally good. So, why do we use the sample mean?
In all of the above, the sample size for the mean was n=100. What would change in the results if we used a sample size of n=1,000,000?
Estimating the Population Proportion
The above looked at estimators of the population mean. Let us now examine two estimators of the population proportion. Note that for proportion estimation, we will draw from a Binomial distribution with success probability p = 0.55.
Sample Proportion
Let us examine the case of estimating the population proportion using the sample proportion, x/n. The procedure is quite similar. Check out what is different in the following code.
estimates = numeric()
for(i in 1:1e4) {
x = rbinom(1, size=10, prob=0.55)
estimates[i] = x/10
}
mean(estimates)
The average estimate is 0.54867. The population proportion is 0.55. Thus, the estimated bias in using the sample proportion is 0.54867 − 0.55 = -0.00133. Because we used 104 iterations, we round this to the nearest hundredth position to get an estimated bias of 0.00.
Again, mathematics can prove that this estimator is unbiased. [Actually, you have the ability to do this mathematically.] However, you are gaining skills for checking bias in general… regardless of the data”s distribution or the estimator.
Agresti-Coull Estimator
The second estimator of the population proportion we will use is the Agresti-Coull estimator (1998). It is (x+1)/(n+2). Let’s see if this estimator is also unbiased.
estimates = numeric()
for(i in 1:1e4) {
x = rbinom(1, size=10, prob=0.55)
estimates[i] = (x+1)/(10+2)
}
mean(estimates)
For me, the estimate average was 0.5402917, which corresponds to an estimated bias of -0.0097083. This is likely not zero. Comparing this bias to our estimates allows us to see just how biased the estimator is: the relative bias is -0.0097083/0.55. This is approximately -1.5%! The Agresti-Coull estimator is a biased estimator (math can prove this, too).
Why would we ever want to use a biased estimator?
In all of the above, the number of iterations was 10,000. What would change if we increased the number of iterations to 1,000,000?
Part II: The Stability
Part I looked at the bias of an estimator. This part looks at its stability (variability, uncertainty, precision). Estimators that are less stable are inherently more uncertain and undesirable. It does little good to have an unbiased estimator if half the time it is 5000 too high and the other half 5000 too low.
Remember that in practice, we have a single number as our estimate of the parameter. We know nothing about how it relates to the population parameter of interest except for what we know about the behavior of the estimator. Thus, it is important to understand this behavior.
Estimators of μ
This section looks at our three estimators of μ. Recall that all three were unbiased. Will it be true that all three have the same variability?
Sample Mean
This code estimates the variability of the sample mean estimator.
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = mean(x) ## <− the estimator
}
var(estimates)
By now, you should be able to determine what this script does and why it estimates the variability of the sample mean estimator. Take your time and go over the lines of code. This is how you learn to think through problems step by step. This is problem solving!
When I ran this, I got 0.09090712. Note that this is close to σ²/n = 3²/100 = 0.09, which is exactly what we would expect in this case. We know the distribution of the sample mean when the variable is from a Normal distribution: X ~ N(μ, σ²/n).
What did you get? It should be close to 0.09 or there is something wrong with your code.
Sample Median
This code estimates the variability of the sample median estimator.
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = median(x) ## <− the estimator
}
var(estimates)
Again, slow down and work through the script by hand. What does each line do, and why does this script estimate the variance of the sample median estimator. You should also be able to create this code for other estimators.
This gave me 0.1409326. (What does it give you?) Thus, it appears as though the sample median estimator is about 50% more variable than the sample mean estimator. [It can be shown that the variability of the median is 1.57× higher than that of the mean.] So, while they are both unbiased, the sample mean is less variable, thus providing estimates that are more precise.
What did you get? It should be close to 0.14 or there is something wrong with your code.
Low variability is high precision.
Sample Midpoint
Here is your chance to determine if sample mean is better than or worse than the sample midpoint estimator. This is also your chance to stretch and create the necessary code… well, you will copy previous code and make the small, necessary change. When you have done this, write out the code here:
My run gave me a variance of 0.8244868. Thus, the sample midpoint is almost 10 times more variable than the sample mean. This is a great reason to use the sample mean instead of the sample midpoint to estimate μ.
Proportion Estimators
Which has a higher variability: the sample proportion or the Agresti-Coull estimator? Figure out the code, run it, and determine which we should use. For this experiment, use the same settings as above, a sample size of n = 10 and a population proportion of p = 0.55.
To check your work, you should have found that the Agresti-Coull estimator is less variable by over 40%. Thus, by this criterion, one should use the biased Agresti-Coull estimator instead of the unbiased sample proportion.
Part III: The Mean Square Error (MSE)
So far, we have looked at two different ways of determining estimator quality: bias and variance. We want low bias and low variance.
In the case of the population mean, our three estimators were all unbiased. The sample mean had the lowest variance. Thus, we probably should use the sample mean to estimate the population mean.
However, when comparing the sample proportion and Agresti-Coull estimators, one had a lower bias and the other had a lower variance. Which should we use? It would be nice if there were a measure of estimator quality that combined bias and variance into a single meaningful number.
There is. It is called the mean square error (MSE). It is calculated as the mean (average) of the squared errors. Since it is also equivalent to the sum of the variance and the square of the bias, this measure takes into account both the average error and the variance of the estimator. (The squaring is important as the bias has the same units as the data, but the variance has the square of the data units.)
Estimating the Population Mean
Estimating the MSE is just as easy as it sounds. Here is the estimated MSE of the sample mean estimator:
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = mean(x) ## <− the estimator
}
mean( (estimates-7)^2 )
For me, the estimated MSE was 0.09082126. This is not surprising, since the MSE is the sum of the squared bias and the variance. The bias for the sample mean estimator is 0. Thus, for unbiased estimators, the MSE is just the variance.
I ran similar code for the median:
estimates = numeric()
for(i in 1:1e4) {
x = rnorm(100, m=7, s=3)
estimates[i] = median(x) ## <− the estimator
}
mean( (estimates-7)^2 )
and got 0.1390497. This is much higher than the MSE of the sample mean estimator. What did you get?
Finally, running similar code for the midpoint estimator gave me a MSE of 0.82. It is the highest of these three estimators of the population mean.
By the way, where does the 7 come from in the code for the MSE in the above two blocks of code?
The following table summarizes these results for me.
Estimator |
Bias |
Variance |
MSE |
Sample Mean |
0.00 |
0.09 |
0.09 |
Sample Median |
0.00 |
0.14 |
0.14 |
Sample Midpoint |
0.00 |
0.82 |
0.82 |
The following table is for you to use.
Thus, given the MSE values for the three estimators, one should prefer the sample mean… as long as the data do come from a Normal distribution.
These conclusions are based on the assumption that the data come from a Normal distribution. Different answers may arise for different data-generating processes.
Estimating the Population Proportion
Finally, which estimator of the population proportion is better under these conditions (sample size = 10, population proportion = 0.55)? Figure out the code, run it, and discover. Here is some space for you to write out the code before trying it in R
.
The following table is for you to use to summarize your work with estimating the population proportion.
So, which estimator is preferred in terms of MSE?
Explain, using statistics.
The Post-Lab
Please answer the following post-lab questions as usual.
-
Which estimator is better in terms of bias, the sample proportion or the Agresti-Coull estimator?
-
Which estimator is better in terms of mean square error, the sample proportion or the Agresti-Coull estimator?
-
The following are histograms of two estimators of the population mean, when μ = 2. Which is the histogram for the sample mean and which is the histogram for the sample midpoint? Explain how you know. Be clear!
Remember that the post-lab is based on correctness as well as your ability to express yourself well. Spend time making sure that there are no errors. Include actual values from your analysis to support each of your answers. This last is important for questions 1 and 2. Without including the statistics you calculated, you are not grounding your answers in reality, and your grade will reflect that.
Make sure you include your script from this lab— properly commented. Your script should include what is in the lab as well as what you do to answer question 2. No script = No points.
* * *
As always: your first page is the title page. Your title page should include your name, the lab title, the date of the lab, and your Knox ID. Start a new page and answer the post-lab questions. After you answer those questions, start a new page and start your code appendix.