IS‘25: IS: SCA 02c

The SCA Procedure

Doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. you show your work).

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

Start R and open a new script.
Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
source("http://rfs.kvasaheim.com/stat200.R")
When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
Load the “crime data set” using the following two lines.
dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
The first line loads the data into the variable dt. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.

That is the end of the zeroeth part. All analyses in this course will start a similar way. The source line is run to give R more functionality, the data are loaded into memory using the read.csv function, and the data are attached to make it easier to access the variables in the data file.

Part I: Measures of Spread (of Variation, of Uncertainty)

In the previous part, we looked at how to calculate the three usual measures of center. In this section, we will look at how to calculate some measures of spread.

When the data are numeric, we can calculate the following measures of spread: IQR, standard deviation, and variance. Note that the standard deviation is always the square root of the variance. We will verify that here. var(crime90) This tells us that the variance of the violent crime rate in 1990 is 151635.9. Knowing the variance is not too useful because its units are difficult to interpret. Because the standard deviation has the same units as the measurements, it is preferred when presenting results to people. Thankfully, there are several ways to calculate the standard deviation of data. Here are three:
- sd(vcrime90)
- var(vcrime90)^0.5
- sqrt(var(vcrime90))
They are all equivalent in terms of calculations. I prefer the first because it is easier for me to type. Plus, it is clear that one is calculating the standard deviation.
Another measure of spread is the interquartile range (IQR). The IQR is defined as the difference between the third and first quartiles in the data. It measures the width of the middle 50% of the data. Here are two ways of having R calculate the IQR of a data set:
- quartiles(vcrime90,3)-quartiles(vcrime90,1)
- IQR(vcrime90)
The first illustrates the actual calculations of the IQR (third quartile minus the first). However, I use the second… again because it is easier for me to type. And so, the IQR of the violent crime rate in 1990 is 422.05.

Part II: Mean or Median?

There is a tendency to calculate and report as few statistics as possible. Personally, I find this limiting, because statistics tell us something about the variable. However, to support this expectation, I provide this section.

How do we know when we should use the mean and when we should use the median? It all depends on how skewed the data are. If the data are too skewed, the median (and IQR) should be used. If the data are not too skewed then the mean (and standard deviation) can be used. Remember that the mean is easier to work with in terms of mathematics and probabilities. Thus, we should use it when we can. However, if the data are “too skewed,” it does not truly represent the “typical data value” and the median should be used.
So, when do we know if the data are too skewed? We can use the Hildebrand rule. Prof. Hildebrand in 1986 created this rule of thumb to delineate between a variable that is too skewed from on that is not.
The Hildebrand rule is based on the ratio of the difference between the median and the mean to the standard deviation. It is a scaled measure of the differences between the median and the mean: $$ H = \frac{\tilde{x} - \bar{x}}{s} $$ According to Hildebrand, if this ratio is less than -0.20, then the variable is “too skewed to the left.” If this ratio is greater than 0.20, then the variable is “too skewed to the right.” If the magnitude of H is less than 0.20, that is if $|H| < 0.20$, then the variable is sufficiently symmetric.
For instance, the median household income in 1990 is not too skewed: hildebrand.rule(medhhd90) Its ratio is +0.036, which is between -0.20 and +0.20.

On the other hand, the initiative use in the 1990s is skewed: hildebrand.rule(inituse) It has a significant positive skew with a ratio of +0.561.
Thus, one “should” use the mean and standard deviation to describe the 1990 median household income. One “should” use the median and IQR to describe the number of citizen initiatives in the 1990s.

SCA 02c

SCA 02c: Measures of Spread

Purpose

Functions

The SCA Procedure

Part O: The Preparations

Part I: Measures of Spread (of Variation, of Uncertainty)

Part II: Mean or Median?

Conclusion


	Ole J. Forsberg, PhD Associate Professor Chair of Data Sciences Office: SMC E-219 (and OM 105) Knox College 2 East South Street Campus Box K-6 Galesburg, IL, USA, 61401-4999	Some Links Knox College of Illinois Department: Mathematics Program: Data Science Program: Statistics R for Starters Project Scarlet Elections @Knox