SCA 02a

SCA 02a: Measures of Center

Purpose

The purpose of this activity is to show you how to have R calculate the sample statistics for you. Remember that the computer is useful for doing the calculations. You are useful for deciding which statistics to use and how to interpret them. FOCUS ON THAT.

Functions

In this SCA, we will be using the following functions in R. It is useful to keep track of where you were introduced to the functions. By the end of the SCA, you should be able to explain what these functions do.

Those functions with a * are only available if this line is run in your script:
source("http://rfs.kvasaheim.com/stat200.R")
Note that you should not include the * when using those functions.

 


The SCA Procedure

Doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. you show your work).

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

  1. Start R and open a new script.
  2. Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
    source("http://rfs.kvasaheim.com/stat200.R")
    When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
  3. Load the “crime data set” using the following two lines.
    dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
    The first line loads the data into the variable dt. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.

That is the end of the zeroeth part. All analyses in this course will start a similar way. The source line is run to give R more functionality, the data are loaded into memory using the read.csv function, and the data are attached to make it easier to access the variables in the data file.

Part I: Measures of Center (of Central Tendency, of Typicality)

This first part looks at how to calculate several measures of center.

  1. The first measure of center we will calculate is the mode. It is the correct measure of center for categorical variables. Recall that the mode is the level with the highest frequency. As such, we can simply use the table function from the last activity. Alternatively, you can use the modal function.
  2. The mode of the census4 variable is South, as both of these will tell you: table(census4) modal(census4)
  3. The difference between these two functions is that table produces a frequency distribution for all levels in the variable. The modal function only returns the modal category.

    Note that the function to calculate the mode is not mode. The function mode returns the variable’s computational type according to R. The variable type returned has to do with how the computer sees the variable not how we see it. Feel free to ignore the mode function. We will never use it.

  4. The mode is excellent for categorical variables. But, what about numeric variables? Well, we have two main options: the median and the mean. Here is the median school enrollment at the state level in 2000: median(enroll00) Here is the mean of the same variable: mean(enroll00) The output should tell you that the median is 91 and the mean is 91.74706 percent enrollment.

Part II: Mean or Median?

There is a tendency to calculate and report as few statistics as possible. Personally, I find this limiting, because statistics tell us something about the variable. However, to support this expectation, I provide this section.

Please note that this section assumes knowledge not provided at this point in the course. As such, it is light on explanation. The actual discussion on this takes place in the “Measures of Dispersion” SCA.

  1. How do we know when we should use the mean and when we should use the median? It all depends on how skewed the data are. If the data are too skewed, the median (and IQR) should be used. If the data are not too skewed then the mean (and standard deviation) can be used. Remember that the mean is easier to work with in terms of mathematics and probabilities. Thus, we should use it when we can. However, if the data are “too skewed,” it does not truly represent the “typical data value” and the median should be used.
  2. So, when do we know if the data are too skewed? We can use the Hildebrand rule. Prof. Hildebrand in 1986 created this rule of thumb to delineate between a variable that is too skewed from on that is not.
  3. The Hildebrand rule is based on the ratio of the difference between the median and the mean to the standard deviation. It is a scaled measure of the differences between the median and the mean. If the magnitude of H is less than 0.20, that is if \(|H| < 0.20\), then the variable is sufficiently symmetric.
  4. For instance, the median household income in 1990 is not too skewed: hildebrand.rule(medhhd90) Its ratio is +0.036, which is between -0.20 and +0.20.

    On the other hand, the initiative use in the 1990s is skewed: hildebrand.rule(inituse) It has a significant positive skew with a ratio of +0.561.
  5. Thus, one “should” use the mean and standard deviation to describe the 1990 median household income. One “should” use the median and IQR to describe the number of citizen initiatives in the 1990s.

Conclusion

That’s the SCA. Review the objectives and the list of R functions I listed at the top of this SCA. Now:

This page was last modified on 2 January 2024.
All rights reserved by Ole J. Forsberg, PhD, ©2008–2024. No reproduction of any of this material is allowed without explicit written permission of the copyright holder.