IS‘25: IS: SCA 03a

The SCA Procedure

Doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. to show your work).

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

Start R and open a new script in your SCA2 folder.
Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
source("http://rfs.kvasaheim.com/stat200.R")
When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
Load the “crime data set” using the following two lines.
dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
The first line loads the data into the variable dt. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.

That is the end of the zeroeth part. All analyses in this course will start a similar way. The source line is run to give R more functionality, the data are loaded into memory using the read.csv function, and the data are attached to make it easier to access the variables in the data file.

Part I: Frequencies

The first thing we will do is tabulate one of the categorical variables. Let us select the variable census4. This variable measures the location of the state in terms of the four census regions.

To create a frequency tabulation, type: table(census4) That is all. Once you run that, you should get the following output: census4 Midwest Northeast South West 12 9 17 13 Thus, from this tabulation, there are 12 states in the Midwest, 9 in the Northeast, 17 in the South, and 13 in the West.
Now, let’s create a tabulation of the categorical variable domPolCulture. This variable classifies the states into its dominant political culture. There are three: individualistic, moralistic, and traditionalistic. According to Salazar (1976), these categories help to explain why certain states pass certain types of laws.

If you are correct, you should get this output: domPolCulture Individualistic Moralistic Traditionalistic 17 17 17 What does this output tell us? Of the states, Salazar classified 17 as individualistic, 17 as moralistic, and 17 as traditionalistic.

Part II: Univariate Graphics for Categorical Variables

The first part shows how to tabulate a categorical variable. There are a lot of options available to perform different types of tabulations. However, the basic table function serves our needs for now. I do, however, encourage you to delve into the help file on this powerful function by running the code ?"table" and its close relative ?"tabulate".

In this part, we will use the table function in conjunction with a couple other functions to produce some nice (default) graphics that illustrate the categorical variable.

Let us now create a basic bar chart of the categorical variable census4. barplot(table(census4)) This bar plot is a graphical representation of the tabulation you did previously. If you would like to change the colors from grey to something else, all you have to do is specify those colors: barplot( table(census4), col=c("red","navy","darkgreen","honeydew2") )
To do a pie chart, you use the function pie. pie( table(census4), col=c("red","navy","darkgreen","honeydew2") )
Note that there are four colors specified, because there are four categories (levels) in the variable census4. What would I run if I wanted to create a simple bar plot of the domPolCulture variable? If you ran the right line instructing R what to do, the bar plot here should look like the one you got.

With additional work, you can make it look like this:
Note that the heights of the bars represent the frequency of that level.
Write the code to get a pie chart of this variable. If you are right, here is the pie chart you will see:

Again, with some additional work, you can make it look like this:

or like this:

Part III: Univariate Graphics for Numeric Variables

In the previous section, we covered some of the graphics available for categorical data. There are many, many, more available in R. We only covered the more important ones covered in the text. In this part, we cover two graphics for numeric data: the histogram and the box-and-whiskers plot (boxplot).

To illustrate the histogram, we need to examine a numeric variable. So, let us use the vcrime90 variable, which represents the violent crime rate in each state in 1990. Here is the code to produce a basic histogram: hist(vcrime90) Here is the output Note that the histogram does tell us some information about the distribution of the violent crime rates. Perhaps it would be better to have more categories hist(vcrime90, breaks=11) Here is the output Note that both tell the story that there is an outlier (far right bar). However, the second gives a better “feel” for the actual distribution of the data.
Here, we see what happens when the number of categories is too high. While we do see the individual data points, it is important to note that graphics are supposed to give a summary of the data.
Now, create a histogram for the unemp2000 variable. Create it using about 10 classes. If you did it right, here is the histogram you got
If you have some graphing skills, then you got

SCA 03a

SCA 03a: Basic `R` Graphics

Purpose

Functions

The SCA Procedure

Part O: The Preparations

Part I: Frequencies

Part II: Univariate Graphics for Categorical Variables

Part III: Univariate Graphics for Numeric Variables

Conclusion


	Ole J. Forsberg, PhD Associate Professor Chair of Data Sciences Office: SMC E-219 (and OM 105) Knox College 2 East South Street Campus Box K-6 Galesburg, IL, USA, 61401-4999	Some Links Knox College of Illinois Department: Mathematics Program: Data Science Program: Statistics R for Starters Project Scarlet Elections @Knox

SCA 03a

SCA 03a: Basic R Graphics

Purpose

Functions

The SCA Procedure

Part O: The Preparations

Part I: Frequencies

Part II: Univariate Graphics for Categorical Variables

Part III: Univariate Graphics for Numeric Variables

Conclusion

SCA 03a: Basic `R` Graphics