Introductory Statistics

 

IS: Lab A, Sampling

[icon]
Lab A: Exploring Sampling Schemes

The main purpose of this particular laboratory activity is to help you better understand the five main sampling schemes. As this is the first lab, I have written more on this than on a typical lab activity.

Remember that these are designed to help you master the material (better understand statistics). That means that this also requires you to think through the steps and seek connections. No assignment will force you to learn. Learning requires a conscious decision and resulting effort on your part.

To help you focus more on the throught processes behind each of the sampling schemes, I am providing the script. Please download this to your Lab folder. Start R from that folder and open the script. Once it is open in R, follow the lab directions below.


Download this Lab’s Script Here



 

The Pre-Lab

All of these laboratory activities have pre-labs that must be completed prior to class on lab day (Moodle Quizzes). These pre-labs consist of a brief quiz covering the lab. You need to take the quiz before class on lab days. While it is possible to do well on the quiz without going through and thinking about the lab material, don’t do it. You are cheating yourself.

These pre-labs do not have to be done the night before the lab. I strongly recommend that you do them in the days leading up to the lab. There will frequently be a week between when we cover a topic in class and when there is a lab on it. Do yourself a favor and start early on the pre-labs.

Also, treat the pre-labs as being material to help you prepare. These are not busy work. You should be able to enter class on lab day and teach all of the material in the lab to us in the class. Learning takes time. Remember that you should be willing to spend at least 15 hours a week on the course material. Your primary job is to be a student. Do not cheat yourself.

✦•······················•✦•······················•✦

[moodle icon]
 Moodle

And so, with no further ado, please complete the Moodle quiz and submit it before class begins.

The quiz is a partial check on your reading of this lab. While you may be able to pass the quiz without thoughtfully doing it, you are cheating yourself. Given how much are you spending on your education, be thoughtful.

The Lab

Part O

Remember that you downloaded the script for this lab above. Thus, you will not need to do typing. You will, however, need to understand the thoughts that went into writing the script.

This lab starts in the same way as all do: Start R, set the working directory, start a new script, and source the usual file from that script,

source("http://rfs.kvasaheim.com/stat200.R")

We will be using random numbers in this activity. So, to get you in the habit of doing this (for science to occur, the results must be able to be replicable), you will need to set the pseudo-random number seed. Do this according to your student ID number. That is, if your ID number is 123456, you would next run

set.seed(123456)

The purpose of this is to allow me to check that you are reporting the numbers correctly. As such, before reporting your final numbers, run the entire script from the beginning. This will ensure that your numbers are the ones I see. This is very important, muy importante, très important, sehr wichtig, etc. If your numbers do not agree with mine, yours are wrong.

As with all sciences:

Precision and accuracy are both important.

In this lab, we will be looking at the five sampling methods covered in class. Working through this lab helps to better understand how the methods are similar and how they differ.

As we are dealing with external data in this lab, you will need to import the data from the Internet to your computer. This is accomplished with the lines

dt = read.csv("http://rfs.kvasaheim.com/data/heightPop.csv") attach(dt)

The first line tells R to go to the provided URL and read the entire csv data file into the variable dt. [Note: I called it dt simply from habit. Feel free to name it whatever you want, like “heightData”.] The variable dt is now a “dataframe.” A dataframe is a container that can hold several different types of variables in a meaningful way. In this case, the dataframe contains the variables height, gender, and zipCode. You can think of a dataframe as a table of values. Each column is a variable. Each row is a record. Here, each record corresponds to a different individual in the population.

In the above code, attach(dt) just makes it easier to access the variables in the dataframe. Without including the second line, one would need to preface all variables with dt$. Thus, the attach function in R simply makes it easier to access the variables in the data set.

These Data

Now that the dataset is stored in your computer’s memory, you can do many things with it. Would you like to see the entire dataset? Enter the following in the Console window:

dt

Would you like to just see the names of the variables in the dataset? Enter the following in the Console window:

names(dt)

Would you like to see a basic summary of each variable in the dataset? Enter the following in the Console window:

summary(dt)

Enter these in the Console window, since you are not going to want to save the lines in your script. The first two are usually done to ensure that the dataset was correctly loaded. The last is usually done to make sure that R is viewing the data in the same way you mean for it to do so; the numeric variables are seen as numeric and the categorical variables are seen as categorical.

The Population

The dataset contains the population of people I owe money to in these two zip codes. While it would be trivial to determine the average height of this population (since I have the population), I would like to explore the five main sampling schemes: simple random sampling, stratified sampling, cluster sampling, systematic sampling, and convenience sampling.

In each of the following sampling schemes, I will sample a total of n = 50 people from the population.

Part I: Simple Random Sampling

The first part of this lab is to perform simple random sampling (SRS) to estimate the average height. Recall that SRS randomly selects the individuals from the entire sampling frame. Each individual has the same probability of being selected to be a part of the sample.

There are N = 1000 people in this population (records contained in the dataframe). Let us randomly select n = 50 people from the population and store their ID numbers in a variable called sampledPeople.

sampledPeople = sample(1000, 50)

Remember the set.seed function above? That line influences the line above. It helps ensure that your sample will differ from everyone else’s sample. It also ensures that I can determine your sample.

Do you want to see the ID numbers (row numbers) of the sample you are selecting? Run this line in the Console window:

sampledPeople

The numbers outputted are the ID numbers of the people in your sample (row/record numbers in the datafile).

Heights of Your Sample

Do you want to see the heights of the people in your sample? Run this line in the Console window:

height[sampledPeople]

This will give you the heights of the 50 people in your sample.

Why does that line give you the heights of the sampled people? Please answer this question in the box below.

In answering this question, you should be thinking about how you would do this by hand. You should then think about how to abstract that action. Finally, you should be able to see what R is doing in this single line of code.

Sample Statistics

Now that you have the randomly-selected records (people), you can calculate the mean height of those sampled people:

mean(height[sampledPeople])

When I ran this, I got 68.5. I am sure that you got a different number. This emphasizes the fact that the sample mean is dependent on the sample. Since the sample is random, so is the value of your sample mean.

Of course, if we wanted to, we could also calculate the variance, interquartile range, and median height of the sampled people:

var(height[sampledPeople]) IQR(height[sampledPeople]) median(height[sampledPeople])

In my sample, these are 25.23469, 7.75, and 68.5, respectively. Again, because the sample is random, your sample statistics will also be random and not necessarily agree with mine.

 

Note: In R, parentheses ( ) indicate that you are applying a function to some values (the values in the parentheses). The brackets [ ] indicate that you are subsetting the entire population and using only the indices of the sampled people.

Part II: Cluster Sampling

Simple random sampling draws a random sample from the entire sampled population. Cluster sampling divides the population into groups that are similar to the population and samples from (at least) one cluster.

Which do you think is the better plan, given that you have sufficient resources? Should you sample from a single cluster or from multiple clusters? How come?

For this sampling scheme, we will cluster on Zip Code (town). In doing this, we are making the assumption that the average height in Galesburg is the same as in Monmouth. This assumption does seem reasonable to me. Later in this course, you will learn a method to test this assumption.

Clustering on Zip Code

In this example, there are two groups: Zip Codes 61401 and 61462, corresponding to the cities of Galesburg and Monmouth respectively. The following determines who is in each of those two Zip Codes:

galesburg = which(zipCode==61401) ## Galesburg people monmouth = which(zipCode==61462) ## Monmouth people

These lines sample 10 people from Galesburg and 40 people from Monmouth.[1]

galSample = sample(galesburg, 10) monSample = sample(monmouth, 40)

Calculating the Cluster Sample Estimate

The cluster sampling estimate is calculated from an unweighted average of the means in each cluster. Thus, the first step is to calculate the mean of each subsample,

galMean = mean(height[galSample]) monMean = mean(height[monSample])

then calculate the unweighted average of these separate means

( galMean + monMean ) / 2

For me, the cluster estimate is 70.2. Again, yours will likely differ.

Consequences of Clustering

This is an unweighted average. It does not matter how many people in my sample are from Galesburg or from Monmouth. We are simply averaging the two sub-means. This works because the grouping variable is not correlated with the measurement (dependent) variable… the population mean is the same for the two sub-populations.

At least, we assume the population mean is the same for the two sub-populations. If not, then this may not be the appropriate sampling method. Remember that cluster sampling requires that the grouping variable is independent of the variable of interest.

In the future (two-sample t-test), we will learn a method for testing this assumption. We are not there yet. We will be in the not-too-distant future, however.

Part III: Stratified Sampling

If the grouping variable is correlated with the dependent variable, then cluster sampling is not appropriate.

After rereading (and rethinking) the cluster sampling part, explain why cluster sampling is not appropriate when the grouping variable is correlated to what you are trying to estimate.

In such a case, stratified sampling is more appropriate. The process is largely the same as for cluster sampling. The only real difference is that the average at the end is an average weighted according to the population proportions.

For this part, we will group the population based on gender identification. Given this grouping variable, it does seem reasonable that cluster sampling will not be appropriate; there is a relationship between gender ID and height.

Here, we stratify on the gender ID of the person. Of all people in this particular population, all identify as either male or female.

mPeople = which(gender=="M") fPeople = which(gender=="F") mSample = sample(mPeople, 10) fSample = sample(fPeople, 40) mMean = mean(height[mSample]) fMean = mean(height[fSample])

Those lines should look familiar. You should be able to explain what each does.

Calculating the Stratified Sample Estimate

The stratified sampling estimate is calculated from a weighted average of the means in each strata— weighted according to the proportion of each gender in the population. Thus, we need to know the proportion of the population (people I owe money to) that identify male.

In general, it is entirely possible that we would know that proportion. The US Census would give that information to us for the population of the United States or of Illinois or of Galesburg. Unfortunately, they do not have that information for the population of people I owe money to. As such, we have to estimate it.

Guessing at a value, I estimate that proportion to be 75%.

propMale = 0.75 propMale*mMean + (1-propMale)*fMean

This estimates the average height, under the assumption that 75% of the people I owe money to identify as male. With this, the estimate I got is 69.5875 in. Again, yours may be different from mine. However, it should not be too different… by some definition of the term “too different.”

Consequences of the Stratified Sample Estimate

Note that the estimate depends on using the correct proportion. If that estimate is wrong, then the stratified sampling estimate is biased. Even worse, we will not know if the estimate will tend to be too high or too low.

Because of this, and in real life, a lot of resources are spent to estimate that proportion and on checking to see how “robust” the final estimate is to errors in the estimated proportion. For instance, if the real proportion of people I owe money to who identify as male is 25%, the final estimate is 67.5625 in.

propMale = 0.25 propMale*mMean + (1-propMale)*fMean

If the real proportion is 50%, then the estimated height is 68.575 in.

Is this estimated height highly dependent on the estimated male proportion? That is a decision for the scientist, not the statistician. The scientist must determine how precise the final estimate must be. For heights, a difference of only 2.0250 in. may not be that important (the difference between the estimate using 25% male and the estimate using 75% male).

However, it may be important. It all depends on what the researcher is ultimately doing with the numbers.

How much does it matter?

The following code creates the graphic showing how the estimated average height depends on the proportion of males to whom I owe money:

propMale = seq(0, 1, by=0.01) avgHeight = propMale*mMean + (1-propMale)*fMean plot(propMale, avgHeight, pch=20, col="orange", xlab="Proportion Male", ylab="Estimated Average Height")

[relationship b/t estimate male proportion and estimated average height]

From this graphic, we can clearly see that my estimated average height is going to be between 66.55 and 70.60 in., regardless of our estimate for the proportion of males to whom I owe money.

Again: Is that 5-inch range significant? Again: That can only be answered by the scientist in us, not the mathematician. It ultimately depends on what we are using this mean for. Different uses will have different precision requirements.

We need to be aware of it, and it should help temper our research conclusions.

It is very interesting that we can determine the average male height in my sample is 70.60, and the average female height is 66.55. How can we determine this from the graphic?

Part IV: Systematic Sampling

Recall that systematic sampling starts with a randomly-selected person (record), then proceeds through the data in a regular fashion. Since the sample size is 50 and the population size is 1000, the proportion of records sampled is 50/1000 = 5%. Since 5% = 1/20, this means we sample 1 every 20 records.

The following code accomplishes this. First, we select a random starting position

start = sample(20, 1)

In my case, I start at the 8th person on the list. Then, we create a sequence of values from that starting value until the end of the population

popSample = seq(start, 1000, 20)

The following questions are things you should be asking yourself whenever you see code. This is a good habit to get into, because it forces you to abstract what you are doing.

  • What does the start indicate?
  • Where did the 1000 come from?
  • Where did the 20 come from?

Again, knowing this helps understand the seq function.

Calculating the Systematic Sampling Estimate

My variable popSample now contains the values 8, 28, 48, 68, …, 968, 988. Next, we calculate the mean height of that sample

mean(height[popSample])

The estimate I get is 68.86 in. Again, your value will likely differ.

How much does it matter?

The following is a graphic of the estimated average height, as a function of the starting position. Graphics like this help you determine how sensitive the conclusions are to the starting point.

Here is the code I used:

estHeight = numeric() for( start in 1:20) { popSample = seq(start, 1000, 20) estHeight[start] = mean(height[popSample]) } plot(1:20, estHeight, pch=21, bg="tomato", xlab="Starting Person (Record)", ylab="Estimated Average Height [in.]" )

Here is the resulting graphic (tidied up a bit):

[effect plot]

Again, there is a “wide” variation in the estimated average height. The estimated averages range from 67.94 to 69.96 in. (for me).

Is that 2-inch range significant? Again: That can only be answered by the scientist in us — not the statistician in us. We need to be aware of it.

This knowledge and understanding of uncertainty should always temper our research conclusions. Failure to do this will result in failing to find important relationships and succeeding in finding unimportant relationships.

Ontology asks the fundamental question:

What is knowable?

As scientists, as mathematicians, and as statisticians, we should contemplate this essential question.

Part V: Convenience Sampling

The final sampling method is convenience sampling. As I have shared in class, this method borders on evil; unscientific results are presented as being scientific. Even if there is a disclaimer, the damage is done: The published results provide the most enduring image… an image that is a lie.

To perform this sampling method, ask the five people around you and find the mean of their heights. I did this and got 67.2 inches.

Part VI: Summary of My Results

In summary, the above results are

Method My Estimates [in.]
Simple Random 68.50
Cluster 70.20
Stratified (75% Male) 69.59
Systematic 68.86
Convenience 67.20

And so, which is the correct answer?

None of them are exactly correct.

Since we have the entire population, we can determine the population mean, μ = 69.054 in. This is one of few times when we can compare our estimates with reality.

Of course, this raises a very interesting question: Why are my estimates all wrong???

Finally

In the future, we will return to this lab and the estimates. We will ask questions about the distribution of sample means, about bias, and about stability. And so, to help that discussion, please post your completed table to this Google Form. For the stratified sampling estimate, use the one you got from supposing the proportion of males is 75%.

Remember to re-run your entire script before completing (and submitting) the above table. This will ensure that your number match what they are supposed to match.

The Post-Lab

That is all there is to this first laboratory activity. Note that the lab supports the post-lab, but does not echo it. If you understand the lab (are able to teach it to others), then the post-lab will be more clear. The post-lab tests your ability to think through (ask self-questions about) the lab, what it illustrates about randomness, and how it ties into the readings and lectures in the course. Make no mistake. The post-lab questions will be at the higher levels of Bloom’s Taxonomy. You will need to think — perhaps work through and contemplate a few times — the lab activity you just completed.

This is key:

Learning takes multiple attempts… and a lot of effort.

The post-lab consists of three questions that focus on the concepts of the lab. If you fully understand the lab, then you will be able to completely answer the post-lab. If the questions on the post-lab cause you to pause, then you should go back over the lab and think again about what you are doing in the lab. Think of the post-lab as a chance for you to test your real understanding of the material. As a student, you should be focused on understanding.

And so, for this post-lab, please answer the following questions. Each numbered part should be a separate paragraph (at least one paragraph). Do not just answer the questions. Make the paragraphs coherent and provide proof in the form of statistics.

  1. What are the differences between cluster sampling and stratified sampling? Be sure to include when each should be used, how the final estimate is made, and how the two estimates differed in your sample.
  2. The population mean height is μ = 69.054 in. Which method estimated the population parameter best? Were the others “close enough”? Explain how you decided this. If you answer “it depends,” then you need to be clear on what it depends.
  3. In Laboratory Activity D, you will be examining the behavior of estimators. Keep aware of this. By the end of that activity, you will be able to determine which of these five sampling methods are best under conditions that you specify. And so, before you get to that lab, which of the five sampling methods do you think is best. Explain fully.

Remember that the post-lab is based on correctness as well as your ability to express yourself well. Spend time making sure that there are no errors. Include actual values from your analysis to support each of your answers. This last is important. Without including the statistics you calculated, you are not grounding your answers in reality, and your grade will reflect that.

Your first page is the title page.Your title page should include your name, the lab title, the date of the lab, and your Knox ID. After you answer the above three questions, start a new page and start your code appendix. In that appendix, include your script from this lab, properly commented.

College is not high school. You need to take the next step in terms of presentation and explanation. It is better to explain things too much than not enough. Take this comment seriously. You have paid a lot of time and money for your education. Do not cheat yourself. Put in the effort to make it all worthwhile.

Finally, if you have not already done so, please post your completed table to this Google Form. This will allow us to compare our results and better understand the randomness inherent in our choice of sampling scheme. For the stratified sampling estimate, use the one you got from supposing the proportion of those identifying as males is 75%.

Footnotes


Footnotes:

[1] There is absolutely no reason for me to select 10 and 40 unless I am very sure about the average height of those in Galesburg and relatively uncertain about those in Monmouth. With no prior information, I would/should sample 25 from each sub-population (group). Here, I chose 10 and 40 to help you see where those subsample sizes are implemented in the code.