SCA 03a

SCA 03a: Basic R Graphics

Purpose

The purpose of this activity is to show you how to create tables and graphics in R. The R Statistical Environment has a myriad of possible graphics available. These graphics are customizable in every detail. The fact that you can overlay different types of graphics makes R one of the preferred graphics programs for statisticians. This quick activity is not designed to have you create such presentation-worthy graphics. It is only to introduce you to making the basics.

Functions

In this SCA, we will be using the following functions in R. It is useful to keep track of where you were introduced to the functions. By the end of the SCA, you should be able to explain what these functions do.

 


The SCA Procedure

Doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. to show your work).

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

  1. Start R and open a new script in your SCA2 folder.
  2. Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
    source("http://rfs.kvasaheim.com/stat200.R")
    When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
  3. Load the “crime data set” using the following two lines.
    dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
    The first line loads the data into the variable dt. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.

That is the end of the zeroeth part. All analyses in this course will start a similar way. The source line is run to give R more functionality, the data are loaded into memory using the read.csv function, and the data are attached to make it easier to access the variables in the data file.

Part I: Frequencies

The first thing we will do is tabulate one of the categorical variables. Let us select the variable census4. This variable measures the location of the state in terms of the four census regions.

  1. To create a frequency tabulation, type: table(census4) That is all. Once you run that, you should get the following output: census4 Midwest Northeast South West 12 9 17 13 Thus, from this tabulation, there are 12 states in the Midwest, 9 in the Northeast, 17 in the South, and 13 in the West.
  2. Now, let’s create a tabulation of the categorical variable domPolCulture. This variable classifies the states into its dominant political culture. There are three: individualistic, moralistic, and traditionalistic. According to Salazar (1976), these categories help to explain why certain states pass certain types of laws.

    If you are correct, you should get this output: domPolCulture Individualistic Moralistic Traditionalistic 17 17 17 What does this output tell us? Of the states, Salazar classified 17 as individualistic, 17 as moralistic, and 17 as traditionalistic.

Part II: Univariate Graphics for Categorical Variables

The first part shows how to tabulate a categorical variable. There are a lot of options available to perform different types of tabulations. However, the basic table function serves our needs for now. I do, however, encourage you to delve into the help file on this powerful function by running the code ?"table" and its close relative ?"tabulate".

In this part, we will use the table function in conjunction with a couple other functions to produce some nice (default) graphics that illustrate the categorical variable.

  1. Let us now create a basic bar chart of the categorical variable census4. barplot(table(census4)) This bar plot is a graphical representation of the tabulation you did previously. If you would like to change the colors from grey to something else, all you have to do is specify those colors: barplot( table(census4), col=c("red","navy","darkgreen","honeydew2") )
  2. To do a pie chart, you use the function pie. pie( table(census4), col=c("red","navy","darkgreen","honeydew2") )
  3. Note that there are four colors specified, because there are four categories (levels) in the variable census4. What would I run if I wanted to create a simple bar plot of the domPolCulture variable? If you ran the right line instructing R what to do, the bar plot here should look like the one you got.
    [barplot]


    With additional work, you can make it look like this:
    [barplot]
    Note that the heights of the bars represent the frequency of that level.
  4. Write the code to get a pie chart of this variable. If you are right, here is the pie chart you will see:
    [piechart]


    Again, with some additional work, you can make it look like this:
    [piechart]


    or like this:
    [piechart]

Part III: Univariate Graphics for Numeric Variables

In the previous section, we covered some of the graphics available for categorical data. There are many, many, more available in R. We only covered the more important ones covered in the text. In this part, we cover two graphics for numeric data: the histogram and the box-and-whiskers plot (boxplot).

  1. To illustrate the histogram, we need to examine a numeric variable. So, let us use the vcrime90 variable, which represents the violent crime rate in each state in 1990. Here is the code to produce a basic histogram: hist(vcrime90) Here is the output
    [histogram]
    Note that the histogram does tell us some information about the distribution of the violent crime rates. Perhaps it would be better to have more categories hist(vcrime90, breaks=11) Here is the output
    [histogram]
    Note that both tell the story that there is an outlier (far right bar). However, the second gives a better “feel” for the actual distribution of the data.
    [histogram]
    Here, we see what happens when the number of categories is too high. While we do see the individual data points, it is important to note that graphics are supposed to give a summary of the data.
  2. Now, create a histogram for the unemp2000 variable. Create it using about 10 classes. If you did it right, here is the histogram you got
    [histogram]

    If you have some graphing skills, then you got
    [histogram]

Conclusion

That’s the SCA. Review the objectives and the list of R functions I listed at the top of this SCA. In addition, let me provide one additional lesson:

This is the reason for my emphasis on graphics in this course. You need them to better understand what the data are telling you. You also need them to get your point across to your audience.



Know your audience!

 

This page was last modified on 2 January 2024.
All rights reserved by Ole J. Forsberg, PhD, ©2008–2024. No reproduction of any of this material is allowed without explicit written permission of the copyright holder.