IS‘25: IS: SCA 03b

SCA 03b: Intermediate `R` Graphics

Purpose

The purpose of this activity is to show you how to create presentation-quality graphics in R. The R Statistical Environment has a myriad of possible graphics available. These graphics are customizable in every detail. The fact that you can overlay different types of graphics makes R the most common graphics programs for statisticians.

Functions

In this SCA, we will be using the following functions in R. It is useful to keep track of where you were introduced to the functions. By the end of the SCA, you should be able to explain what these functions do. Be clear that these functions allow us to modify the most minute details of graphics produced by R. When performing the same analyses over and over again, requiring a new graphic each time, these are very important!

c
seq
cor.test
plot
hist
histogram*
par
plot.new
plot.window
axis
title
points
dev.copy
dev.off

Those functions with a * are only available if this line is run in your script:
source("http://rfs.kvasaheim.com/stat200.R")
Note that you should not include the * when using those functions.

The SCA Procedure

As said before, doing real statistics requires actually doing statistics on real data. That means using a computer and a statistical program. After trying many different programs, R matches my needs as an analyst most closely. It also allows me to easily check your understanding of statistical techniques because it requires you to provide the script (a.k.a. to show your work).

Part O: The Preparations

The following are common start-up instructions. You will want to always follow them when starting analyses.

Start R and open a new script in your SCA3 folder.
Now, since we will be using some special R functions that do not exist in the base R package, we will need to import them. Making sure you have an Internet connection, run this line:
source("http://rfs.kvasaheim.com/stat200.R")
When you run this line, R goes to the URL you specified and runs that code. Here, the code only imports several helpful functions. From now forward, I will assume you run this line for every script in this course.
Load the “crime data set” using the following two lines.
dt = read.csv("http://rfs.kvasaheim.com/data/crime.csv") attach(dt)
The first line loads the data into the variable dt. The second line “attaches the data,” which makes it easier for us to access the variables in the data file.

That is the end of the zeroeth part. All analyses in this course will start a similar way. The source line is run to give R more functionality, the data are loaded into memory using the read.csv function, and the data are attached to make it easier to access the variables in the data file.

Part I: Histograms

In SCA 3a, we saw that the hist() function allowed us to create basic, yet useful for us, histograms. They were sufficient to allow us to quickly explore the distribution of a numeric variable. There is an advanced function that allows us to create presentation-quality histograms, the histogram() function. This part examines that function. By the end of this part, you will create a histogram that looks like this

To get to this histogram, you will learn about the following R functions and graphics parameters.

First, here is the first histogram we created in the previous SCA:

Notice a few things about it that detract from the story it tells.

First, the horizontal axis label is the code name of the variable, not the actual name of the variable. The reader does not necessarily know what “vcrime90” means. A better axis label is “Violent Crime Rate (1990).”
Second, the main title of the histogram is redundant. Clearly, this is a histogram Clearly, this histogram is of whatever the horizontal label is.
Third, the vertical axis is also not necessary. As this is a histogram, the bar heights are either frequencies or relative frequencies. From our readings, we know that both will give the exact same histogram shape. Thus, this information is not needed. Some will make a good argument that the vertical axis gives a hint to the sample size and is thus needed. However, I would argue that it only gives a hint to the sample size. As such, the sample size needs to be given in the text (and caption) anyway.
Fourth, the bars do not reach to the x-axis. While this detail is rather minor, it does bring into question how well the bars reflect the actual counts in each class.
Finally, and probably least importantly, the bars are grey. While this is a minor point, one may want to specify the colors to match other aspects of the analysis.

Here is the code that will produce the opening histogram.


par(mar=c(4,1,1,1) )
par(family="serif")
par(font.lab=2, cex.lab=1.1, cex.axis=0.9)

histogram(vcrime90, col="green4", breaks=seq(0,2500,125), xlim=c(0,2500) )

axis(1)
title(xlab="Violent Crime Rate (1990)", line=2.5)

Note that this code is split into three sections: defining, creating, adding. The following looks at these sections.

Plotting Section A: Defining the Plot

In the above code, “defining the plot” consists of the lines


par(mar=c(4,1,1,1))
par(family="serif")
par(font.lab=2, cex.lab=1.1, cex.axis=0.9)

Note that these are applications of the par() function. All code that goes into this plotting section will be uses of the par() function, because it defines the parameters of the graphic that you will be creating in the future.

The first use here sets the margins of the plot. The margins are defined in units of “lines” and the sides are numbered 1, 2, 3, and 4, corresponding to the bottom, left, top, and right edges of the graphic. Thus, par(mar=c(4,1,1,1)) tells R to give the next graphic a margin of 4 lines at the bottom and 1 line on the other three sides. This will allow the graphic to fill more of the plotting window than default.

The second use specifies the font family to use. By default, R uses a sans-serif font. Since I am writing my paper in a serif font (like Times New Roman), I should have my graphics use the same font family. This is how to set it.

Useful values for family include serif, sans, and mono. The value serif provides a serif font to complement body text that is also a serif font (this paragraph is in serif). The value sans provides a sans-serif font for a body text that is like Calibri or Arial (the other paragraphs on this page are in a sans-serif font). Finally, the value mono provides a monospaced font similar to Courier (the code lines are in a monospaced font). Choose the option that matches your body text.

The third line actually sets three parameters at once (you can set as many in a single line as you wish, with each separated by a comma). It sets the font for the labels to be 2 (bold). It sets the size of the labels to be 10% larger than normal. It sets the size of the values along the axes to be 10%less than normal. The options for font.lab are 1 (normal), 2 (bold), 3 (italic), and 4 (bold italic).

Plotting Section B: Creating the Plot

The second plotting section actually creates the plot. There are only a few functions that can be used in this section. The graphing functions covered in SCA 1a are all included. So, too, are the histogram() and the plot() functions.

The functions that go here actually create the basic plot, given the parameters you set above. In the above code, the line


histogram(vcrime90, col="green4", breaks=seq(0,2500,125), xlim=c(0,2500))

produces a histogram of the variable vcrime90 (the violent crime rate in 1990). It specifies the bar colors to be dark green (green4). The histogram breaks, which define the classes, are at 0, 125, 250, 375, 500, …, 2375, 2500 (i.e., starting at 0, ending at 2500, and of width 125). The histogram only displays from 0 to 2500.

Note that the breaks must include all of the data… must. On the other hand, the xlim parameter does not. Thus, one can easily exclude outlier Washington, DC, from the graphic by changing the last part to

xlim = c(0,1500)

Plotting Section C: Adding to the Plot (Annotation)

The functions that go here add important information to the plot you created in Plotting Section B. Some important functions here include the axis and title functions, as above


axis(1)
title(xlab="Violent Crime Rate (1990)", line=2.5)

The axis function plots the axis. The only required parameter is the side. In this example, it is 1, the bottom axis (x-axis). One can also specify where to put the tick marks (values) and what to put at those values. Thus, we could also have specified the axis function as


axis(1, at=seq(0,2500,500), label=seq(0,2500,500))

If we wanted more precision in our graphic (more values along the x-axis), we could have written


axis(1, at=seq(0,2500,250), label=seq(0,2500,250))

or even


axis(1, at=seq(0,2500,125), label=seq(0,2500,125))

With that said, I am content with values every 500 points. You should explore possible alternatives to see which tell the story best. The graphics you create reflect your decisions, as well as your data. Be aware of that. Just as you judge graphics, so too do others.

The second line in the above code provides the label for the x-axis. Specifying it here allows you to tell the reader the variable, not its code. It also allows you to specify where in the horizontal axis to place it (line=). The defaults for the line parameter are not bad. I just personally prefer to space the labels in a more aesthetically pleasing place (to me). For the x-axis, the line is 2.5. For the y-axis, it is 2.75.

And so, putting all of this together gives this histogram

Part II: Scatter Plot

A second important graphic to learn is the scatter plot. You have seen scatter plots for many years already. They are very useful for seeing (and illustrating) the relationship between two numeric variables. I this section, we will create basic scatter plots and modify them to make them suitable for your readers.

Note that I am using this section to make sure that you are working through these statistical computing activities and thinking about them. I am introducing a couple functions that you will need for the first practicum activity.

In this section, let us look at the relationship between the violent crime rate in 2000 and the school enrollment in 1990. Since the enrollment is measured in 1990 and the violent crime rate is measured in 2000, the enrollment will be the independent variable and the violent crime rate will be the dependent variable.

Here is how to create a basic scatter plot


plot(enroll90, vcrime00)

Note that this is in the form of plot(x, y). The following is the graphic produce by that one line.

Again, this is excellent for the researcher to get a feel for the data. Again, let us note some things that could be improved upon for the reader.

First, the axis labels are code and not actual variable names.
Second, the axis values are parallel to the axis, as opposed to parallel to the body text (the y-axis values should be turned).
Finally, we may want to make the plotting characters something other than bare circles, ○.

Here is the final scatter plot I made

You can decide how it is better than the initial plot. You can also decide what you would like to change on it to tell the story of the data better.

By the way, note that the year is in parentheses and the units are in brackets. This is the expected notation in scatter plots… not just for me but for your professional readers. This is an interesting graphic. It does illustrate the correlation between the two variables, which we can have R calculate with


cor.test(enroll90, vcrime00)

We get the correlation between the two numeric variables as


0.1152993

This is a very weak level of correlation (relationship) between the school enrollment in 1990 and the violent crime rate in 2000. In the future, you will learn its statistical meaning, but the output also tells us that there is no evidence that there actually is a relationship between these two variables:


p-value = 0.4204

Since that p-value is greater than 0.05, we do not have significant evidence that the correlation differs from zero (no linear relationship). By the way: When we move through the fourth learning module of the course, you should be able to explain what this last sentence means and what the p-value means. For now, simply accept that a p-value greater than 0.05 means there is no real evidence of a relationship between these two variables.

How did I get the above graphic? Here is the code. Again, note that I have put it into three separate sections:


par(mar=c(4,4,1,1))
par(family="serif")
par(font.lab=2, cex.lab=1.25, cex.axis=0.9)
par(las=1)
par(xaxs="i", yaxs="i")

plot.new()
plot.window( xlim=c(80,105), ylim=c(0,1750) )

axis(1)
axis(2, at=seq(0,2000,250))
title(xlab="School Enrollment (1990) [%]", line=2.75)
title(ylab="Violent Crime Rate (2000)", line=3)
points(enroll90, vcrime00, pch=21, bg="tomato", cex=1.25)

There are some lines in the first plotting section that are new. The parameter values xaxs="i" and yaxs="i" specify that there should be no internal spacing in the plot. This allows the axes to actually cross at zero.

The second new parameter line forces las=1. This causes the values on the axes to be horizontal. This makes it (slightly) easier on your reader; there is no need for them to adjust their orientations.

The “Creating the Plot” plotting section consists of two generic functions. The first just tells R to create the plot (plot.new). The second specifies the x- and y-axis limits (plot.window).

When dealing with scatter plots, it is best to produce the graphic in this method, as it allows you to more easily modify every part of the graphic.

The “Adding to the Plot” section consists of the final five lines. The first two draw the axes (with the y-axis having marks from 0 to 2500 in steps of 250). The next two label the axes. The last one draws the points. The points are dots filled with the color “tomato.” Many other colors are possible, like “chocolate,” “springgreen,” and “thistle.” For an entire list of named colors, just run the line


colors()

to see all 657 color names available.

The End

That is all for this SCA. I strongly encourage you to read the following to make your graphics look more uniform.

PS: Here is a quick note on the philosophy behind R’s graphing. R sees plots the way a painter sees a painting. The idea for the drawing is visualized first. The parameters of the painting are specified (size, margin, what font to use, etc.). The painting is started. The first layer is laid, then the second, then the third, etc. Happy little trees are placed next to the mighty river. An eagle is put on the tree. All things are added on top of what already exists. And, if there is an error, the canvas is thrown out and the painter starts over.

PPS: A natural question is “How do we get the graphics from R to Word?” There are a couple ways. I find the following most helpful/useful, especially when I am working on my Mac:


dev.copy(png, filename="plot.png", width=6, height=6, units="in", res=600)
dev.off()

The first line opens a file in your computer called plot.png. It is a png file (type of image). This image is 6 inches wide by 6 inches high with a resolution of 600dpi (nice size for printing graphics). The second line closes/saves that file and makes it available for you to import into Word. Both lines are necessary to create an image file to important into Word (or the like).

Note that you should change the name of the file to something descriptive and not keep it “plot.png”. You should leave the resolution alone as well as the file type. You may want to change the width and height to match your needs. The height and width for the green histogram were 8 and 4 to help if better fit the place I was putting it.

PPPS: Note that there are three lines common to the graphics above (and a fourth that is common for scatter plots). To make things easier, I created the theme function. So, in place of typing


par(xaxs="i", yaxs="i")
par(family="serif", las=1)
par(mar=c(3,3,0,1)+1)
par(cex=1, cex.lab=1.2, cex.axis=0.9, cex.sub=0.9, cex.main=1.2 )
par(font=1, font.lab=2, font.axis=1, font.sub=3, font.main=2)
par(bg="transparent", xpd=NA, pch=20, col=1)

in the parameter section of your graphic, you would just type


theme()

So, the following are two options for creating the scatter plot above


par(mar=c(4,4,1,1))
par(family="serif")
par(font.lab=2, cex.lab=1.1, cex.axis=0.9)
par(las=1)
par(xaxs="i", yaxs="i")

plot.new()
plot.window( xlim=c(80,105), ylim=c(0,1750) )

axis(1)
axis(2, at=seq(0,2000,250))
title(xlab="School Enrollment (1990) [%]", line=2.75)
title(ylab="Violent Crime Rate (2000)", line=3)
points(enroll90, vcrime00, pch=21, bg="tomato", cex=1.25)


theme()
par(mar=c(4,4,1,1))

plot.new()
plot.window( xlim=c(80,105), ylim=c(0,1750) )

axis(1)
axis(2, at=seq(0,2000,250))
title(xlab="School Enrollment (1990) [%]", line=2.75)
title(ylab="Violent Crime Rate (2000)", line=3)
points(enroll90, vcrime00, pch=21, bg="tomato", cex=1.25)

… your choice.

The Graphics Checklist

At long last, here is a checklist for you to follow when creating your graphics. Since presentation of ideas is an important aspect of statistics, graphics are graded. In the interest of transparency, here is what I will be checking when grading your graphics:

Theme used
Color used
No main title on the graphic
Axis label(s) are descriptive

This last checkpoint means that the variable name should not be included. Instead of vcrime90, you should use “Violent Crime Rate (1990).” Similarly, instead of census4, you should use “Census Region.” Why? The graphic is much more readable.

FINAL NOTE

I recommend revisiting this SCA each time you need to create a graphic.


	Ole J. Forsberg, PhD Associate Professor Chair of Data Sciences Office: SMC E-219 (and OM 105) Knox College 2 East South Street Campus Box K-6 Galesburg, IL, USA, 61401-4999	Some Links Knox College of Illinois Department: Mathematics Program: Data Science Program: Statistics R for Starters Project Scarlet Elections @Knox

SCA 03b

SCA 03b: Intermediate R Graphics