The SCA Procedure
Doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R
matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. to show your work).
Part O: The Preparations
The following are common start-up instructions. You will want to always follow them when starting analyses.
- Start
R
and open a new script in your SCA2 folder. - Now, since we will be using some special
R
functions that do not exist in the baseR
package, we will need to import them. Making sure you have an Internet connection, run this line:
source("http://rfs.kvasaheim.com/stat200.R")
When you run this line,R
goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course. - Load the “crime data set” using the following two lines.
dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
The first line loads the data into the variabledt
. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.
That is the end of the zeroeth part. All analyses in this course will start a similar way. The source
line is run to give R
more functionality, the data are loaded into memory using the read.csv
function, and the data are attached to make it easier to access the variables in the data file.
Part I: Frequencies
The first thing we will do is tabulate one of the categorical variables. Let us select the variable census4. This variable measures the location of the state in terms of the four census regions.
- To create a frequency tabulation, type:
table(census4)
That is all. Once you run that, you should get the following output:census4 Midwest Northeast South West 12 9 17 13
Thus, from this tabulation, there are 12 states in the Midwest, 9 in the Northeast, 17 in the South, and 13 in the West. - Now, let’s create a tabulation of the categorical variable domPolCulture. This variable classifies the states into its dominant political culture. There are three: individualistic, moralistic, and traditionalistic. According to Salazar (1976), these categories help to explain why certain states pass certain types of laws.
If you are correct, you should get this output:domPolCulture Individualistic Moralistic Traditionalistic 17 17 17
What does this output tell us? Of the states, Salazar classified 17 as individualistic, 17 as moralistic, and 17 as traditionalistic.
Part II: Univariate Graphics for Categorical Variables
The first part shows how to tabulate a categorical variable. There are a lot of options available to perform different types of tabulations. However, the basic table function serves our needs for now. I do, however, encourage you to delve into the help file on this powerful function by running the code ?"table"
and its close relative ?"tabulate"
.
In this part, we will use the table function in conjunction with a couple other functions to produce some nice (default) graphics that illustrate the categorical variable.
-
Let us now create a basic bar chart of the categorical variable census4.
barplot(table(census4))
This bar plot is a graphical representation of the tabulation you did previously. If you would like to change the colors from grey to something else, all you have to do is specify those colors:barplot( table(census4), col=c("red","navy","darkgreen","honeydew2") )
-
To do a pie chart, you use the function
pie
.pie( table(census4), col=c("red","navy","darkgreen","honeydew2") )
- Note that there are four colors specified, because there are four categories (levels) in the variable census4. What would I run if I wanted to create a simple bar plot of the domPolCulture variable?
If you ran the right line instructing
R
what to do, the bar plot here should look like the one you got.
With additional work, you can make it look like this:
- Write the code to get a pie chart of this variable.
If you are right, here is the pie chart you will see:
Again, with some additional work, you can make it look like this:
or like this:
Part III: Univariate Graphics for Numeric Variables
In the previous section, we covered some of the graphics available for categorical data. There are many, many, more available in R
. We only covered the more important ones covered in the text. In this part, we cover two graphics for numeric data: the histogram and the box-and-whiskers plot (boxplot).
-
To illustrate the histogram, we need to examine a numeric variable. So, let us use the vcrime90 variable, which represents the violent crime rate in each state in 1990. Here is the code to produce a basic histogram:
hist(vcrime90)
Here is the outputhist(vcrime90, breaks=11)
Here is the output
-
Now, create a histogram for the unemp2000 variable. Create it using about 10 classes.
If you did it right, here is the histogram you got
If you have some graphing skills, then you got