IS‘25: IS: SCA 07c

SCA 07c: Bootstrapping the Data

Purpose

This SCA is also a bit different from our post activities. In the past, you have spent your time being exposed to R functions and how to manipulate them. If you have been following along, you now have the skills to generate random samples from data (pretending it is the population), calculate sample statistics from those samples, and calculate probabilities (both approximately and exactly).

This activity has you examine the distribution of the sample mean from data. That means you will need to…

generate a random sample from the observations (with replacement),
calculate the sample mean for that distribution,
save that value into a variable for further use, and
repeat these steps many, many, many times.

At the end, you will have a variable filled with means calculated on different samples.

That process sure sounds familiar!

Functions

In this SCA, we will be using the following functions in R. It is useful to keep track of where you were introduced to the functions. By the end of the SCA, you should be able to explain what these functions do. Be clear that these functions allow us to modify the most minute details of graphics produced by R. When performing the same analyses over and over again, requiring a new graphic each time, these are very important!

for
sample
dir
numeric
overlay*
names
read.csv
mean
shapiroTest*

Those functions with a * are only available if this line is run in your script:
source("http://rfs.kvasaheim.com/stat200.R")
Note that you should not include the * when using those functions.

The SCA Procedure

As usual, here is the procedure. Make sure you understand that the goal is not for you to get to the finish line. The goal is for these steps to help you better understand how to achieve the purposes and goals listed above.

Racing through these without savoring them wastes your precious time.

Part O: Importing the Data

As you have already discovered, statistics is not just learning about the sample. More importantly, it concerns learning something about the population. To learn about the population, we calculate statistics on the sample and use our knowledge of the distribution of that statistic to better understand the population parameter.

This part examines the distribution of the sample mean. The process can easily be generalized to different data generating function and different statistics. We will see this later in this SCA.

Start R and open a new script in your SCA7 folder. Title this script “sca07c.R” or something like that.
Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line: source("http://rfs.kvasaheim.com/stat200.R") When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
Next, since we will be working with data, let’s load the data:


dt = read.csv("http://rfs.kvasaheim.com/data/geography.csv")
attach(dt)

These data are the results of a previous class taking a geography quiz. Let’s learn about the distribution of the sample mean. Think back to Statistical Laboratory Activity C. Regardless of the distribution of the data, what distribution will the sample means approach as the sample size increases?

Part I: Getting to Know Your Data

As you have already discovered, statistics concerns more than just learning about the sample. More importantly, it concerns learning something about the population. To learn about the population, we calculate statistics on the sample and use our knowledge of the distribution of that statistic to better understand the population parameter.

If you would like to see the entire data set, run

dt

If you want to see just the variable names, run


names(dt)

If you want a summary of each variable, run


summary(dt)

If you just want a summary of the Score variable, run


summary(Score)

If you want some sample statistics for the center


modal(Score)
mean(Score)
median(Score)

If you want to know if the data are too skewed


hildebrand.rule(Score)

If you want some measures of spread


sd(Score)
IQR(Score)

Since the possible values are few, we could tabulate the outcomes


table(Score)

We can also get a graphical look


hist(Score)

Or, since the possible outcomes are few and equally-spaced, we could use a bar chart barplot(table(Score)) That bar chart represents the data better (and more easily) than the histogram. Why?

Part II: The Distribution of the Sample Mean

The Sandlot from SCA 07a looks like this


st = numeric()

for(i in 1:1e3) {
  x = rexp(10, rate=3)
  st[i] = mean(x)
}

By now, you should be able to state what each line does and why it is important to estimating the distribution of the sample mean.
This code will help you see the distribution of the sample mean of size n = 10, where the data come from an Exponential distribution with rate parameter λ = 3.
This is not exactly what we want to do, however. We want to draw our samples from the scores on the Geography Pop-Quiz. Everything else in that code will remain the same. All that changes is how we draw our sample.


st = numeric()

for(i in 1:1e3) {
  x = sample(Score, size=18, replace=TRUE)
  st[i] = mean(x)
}

What is different between this listing and the previous one? What does size=18 mean? (Check out what is stored in x.) Why is replace=TRUE? What happens if we use replace=FALSE? Try it out to see. Think about the meaning of the word replace. Ask if you are not sure.
Now that we have the sample means stored in the variable st, we can analyze it in the same ways we analyzed the sample means last week


# The raw sample means
st

# The distributions
overlay(Score)                    # of data
overlay(st)                       # of sample means

# Tests of Normality
shapiroTest(Score)                # of data
shapiroTest(st)                   # of sample means

# Two intervals
quantile(Score, c(0.025,0.975))   # Observation interval
quantile(st, c(0.025,0.975))      # Confidence interval

Pay close attention to the last two lines, the lines calculating the observation interval and the confidence interval. How are they different? This gets at the heart of the difference between an observation interval and a confidence interval.

Interpretation 1: Here is the histogram of the scores overlaid with a Normal curve.

[normoverlay]

Note that the scores do not look Normal at all. The 95% observation interval is from 0.0 and 4.6. This means that 95% of the observed values are between 0.0 and 4.6 (inclusive). That is what an observation interval tells us.

Interpretation 2: Here is the histogram of those sample means overlaid with a Normal curve.

[normoverlay]

Note that the sample means look much more Normal than do the original measurements (Score). This is a consequence of the Central Limit Theorem.

In addition to this observation, what else do we know from this analysis? We also are 95% confident that the population mean is between 1.2 and 2.8 (confidence interval).

Key: The “observation interval” concerns values already observed. The “confidence interval” concerns the population parameter.


	Ole J. Forsberg, PhD Associate Professor Chair of Data Sciences Office: SMC E-219 (and OM 105) Knox College 2 East South Street Campus Box K-6 Galesburg, IL, USA, 61401-4999	Some Links Knox College of Illinois Department: Mathematics Program: Data Science Program: Statistics R for Starters Project Scarlet Elections @Knox