Introductory Statistics

 

IS: Lab B, Discrete Distributions

[icon]
Lab B: Discrete Distributions and Sampling

The purpose of this laboratory activity is to help you better understand the relationships between probability distributions and the phenomena they describe. Here, we look more closely at random sampling and the effects that “small” decisions have on the estimators and their precision.

Remember that these are designed to help you master the material. However, this also requires you to think through the steps and seek connections. No assignment will force you to learn. Learning requires a conscious decision — and resulting effort — on your part.

In the first laboratory activity, I provided the script for a variety of reasons. One of those was to allow you to see the relationship between the lab and the script. Since you will be creating your own scripts to do statistical analyses in the future, I am not providing the script for this (or any following labs). Thus, you will have to think about how to structure your comments to make your script understandable. The metaphor to use is the essay. Structure your script like an essay with section headings.

The Pre-Lab

All of these laboratory activities have pre-labs that must be completed prior to class on lab day (Moodle Quizzes). These pre-labs consist of a brief quiz covering the lab. You need to take the quiz before class on lab days. While it is possible to do well on the quiz without going through and thinking about the lab material, don’t do it. You are cheating yourself.

These pre-labs do not have to be done the night before the lab. I strongly recommend that you do them in the days leading up to the lab. There will frequently be a week between when we cover a topic in class and when there is a lab on it. Do yourself a favor and start early on the pre-labs.

Also, treat the pre-labs as being material to help you prepare. These are not busy work. You should be able to enter class on lab day and teach all of the material in the lab to us in the class. Learning takes time. Remember that you should be willing to spend at least 15 hours a week on the course material. Your primary job is to be a student. Do not cheat yourself.

✦•······················•✦•······················•✦

[moodle icon]
 Moodle

And so, with no further ado, please complete the Moodle quiz and submit it before class begins.

The quiz is a partial check on your reading of this lab. While you may be able to pass the quiz without thoughtfully doing it, you are cheating yourself. Given how much are you spending on your education, be thoughtful.

The Lab

This lab starts in the same way as all do: Start R, start a new script, set the working directory, and source the usual file from that script,

source("http://rfs.kvasaheim.com/stat200.R")

We will be using random numbers in this activity. So, to get you in the habit of doing this (for science to occur, the results must be able to be replicable), you will need to set the pseudo-random number seed. Do this according to your student ID number. That is, if your ID number is 123456, you would next run in your script window

set.seed(123456)

The purpose of this is to allow me to check that you are reporting the numbers correctly. As such, before reporting your final numbers, run the entire script from the beginning. This will ensure that your numbers are the ones I see. This is very important, muy importante, très important, sehr wichtig, etc. If your numbers do not agree with mine, yours are wrong.

* * *

For this activity, we will look at three different methods for collecting data using simple random sampling, noting that each method produces counts that follow three different distributions. The three methods are

We then estimate the proportion of students at Knox who are sophomore-level by dividing the number of sophomores in our sample by the number of students sampled.

Lastly, we look at how the three estimators behave, especially with respect to the true sophomore-proportion of 365/1324 ≈ 27.6%.

Part I: The Binomial Distribution

Let us (virtually) take a random sample from the target population. The sample is of size n = 200. If we allow for asking the same person twice (i.e., sample with replacement), then the number of sophomores in our sample follow a Binomial distribution.

Why is it Binomial? Check the five requirements.

Sampling a Binomial

The following R code will generate (simulate) the outcome of asking 200 students if they are sophomores and store it in the variable SophB.

SophB = rbinom(1, size=200, prob=365/1324)

To see the outcome, run SophB in the Console window. Note that this is not the proportion, it is the number of Sophomores in our random sample from the population. To calculate a proportion from this, we simply divide by the size of our sample, n = 200:

pSophB = rbinom(1, size=200, prob=365/1324)/200

[warning] Note that pSophB is called an estimator because it estimates a statistic of interest. Note also that the number of values in pSophB is 1. That observed value is called an estimate. We cannot understand the behavior of estimators unless we have many, many, many estimates with which to work.

To generate a million such estimates, we run

pSophsB = rbinom(1e6, size=200, prob=365/1324)/200

Note that 1e6 is equivalent to 1 × 106, which is a million (1,000,000).

After running this line, the variable pSophsB holds a million estimates, each is a proportion. Since we have so many values, we have a lot of information about the distribution of proportions. Answer the following questions:

  • What is the mean of the proportions? I got 0.276.
  • What is the standard deviation? I got 0.032.
  • Describe the histogram of the observed proportions.
  • Determine a central 95% observation interval. I got from 0.215 to 0.340.
  • Is the true value of 365/1324 in that interval?


Remember that the 95% observation interval can be calculated using this code:

quantile(pSophsB, c(0.025,0.975))

As an interesting aside, the observation interval and the confidence interval are identical in this case. This is because we are “observing” a measure of center, the sample proportion.

Part II: The Hypergeometric Distribution

The previous part assumed that individuals could be contacted more than once (i.e., with replacement). That seems to be a waste of energy. Why ask the same person more than once? To increase efficiency in our estimates, let us now specify that individuals cannot be asked more than once (i.e., without replacement).

Under this circumstance, the observed number of sophomores will follow a Hypergeometric distribution.

Why is this the case? What is it about this method that makes the random variable follow a Hypergeometric distribution?

In answering this, think about how the Hypergeometric distribution differs from the Binomial distribution. The difference is slight, but important!

As in most things in life:

The key to success is paying attention to the details.

Sampling a Hypergeometric

The following will generate a million proportions when the counts are drawn without replacement:

pSophsH = rhyper(1e6, m=365, n=1324-365, k=200)/200

Answer the following questions:

  • What is the mean of the proportions? I got 0.276.
  • What is the standard deviation?
  • In the function call, what do m, n, and k represent?
  • Describe the histogram of the proportions.
  • Determine a central 95% interval. I got from 0.220 to 0.335.
  • Is the true value of 365/1324 in that interval?

Part III: The Geometric Distribution

The above two designs set the number of students being asked. The following does not. Its random variable is the number of people asked until the first sophomore is found. Thus, if the outcome of the Geometric random variable is 5, then 5 people were asked before the first success (before a sophomore was found); that is, six were asked and the last was a sophomore, and the estimate would be 1 / (5+1).

What are the differences between the Geometric distribution and the previous two distributions?

In answering this question, return to how the previous two distributions were defined. What were the requirements of the Binomial and the Hypergeometric? Knowing this will help you better identify the appropriate distribution.

Understanding how random variables are generated helps to better understand the differences between — and among — them.

Sampling a Geometric

The following code generates a million proportions when the data are generated using the “ask until your first success” method.

pSophsG = 1/(1+rgeom(1e6, prob=365/1324))

Double-check that you understand why the denominator is 1+rgeom(1e6, prob=365/1324). Understanding this point will help you better understand why pSophsG gives an appropriate proportion.

Answer the following questions:

  • What is the mean of the proportions? I got 0.491.
  • What is the standard deviation?
  • Describe the histogram of the proportions.
  • Determine a central 95% observation interval. I got from 0.083 to 1.000.
  • Is the true value of 365/1324 in that interval?

The Post-Lab

That is all there is to this second laboratory activity. Note that the lab supports the post-lab, but does not echo it. If you understand the lab (are able to teach it to others), then the post-lab will be more clear. The post-lab tests your ability to think through (ask self-questions about) the lab, what it illustrates about randomness, and how it ties into the readings and lectures in the course. Make no mistake. The post-lab questions will be at the higher levels of Bloom’s Taxonomy. You will need to think — perhaps work through and contemplate a few times — the lab activity you just completed.

Mastery takes effort. Aim for mastery.

These are the post-lab questions:

  1. Discuss the differences in the three data-collection methods, specifically their assumptions and how you would physically carry out each to estimate the proportion of students who are Sophomores at Knox.
  2. Compare the three methods in terms of how close the point estimates (means) were to the real value. Also, compare the methods in terms of the interval widths. (Is a wider or narrower interval better? Why?)
  3. If your job is to provide the best estimate and most-precise interval for the proportion of students who are sophomores at Knox College, which data collection/analysis method would you use? Why?

Remember that the post-lab is based on correctness as well as your ability to express yourself well. Spend time making sure that there are no errors. Include actual values from your analysis to support each of your answers. This last is important. Without including the statistics you calculated, you are not grounding your answers in reality, and your grade will reflect that.

Your first page is the title page. Your title page should include your name, the lab title, the date of the lab, and your Knox ID. After you answer the above three questions, start a new page and start your code appendix. In that appendix, include the script you used to answer the post-lab questions, properly commented. To be “properly commented,” I should know how each block of lines addresses a question or problem in the lab.