R Plots Using plot()

In this post, an example of creating multiple plots on a page, using the plot() function in R, is presented. Four plots are generated using data from some government websites, including the data found in the blog of last month.

To start, the library ‘TeachingTools’ is loaded. Then the layout for the plots is entered.

library("TeachingTools")
layout(matrix(c(1,2,3,4),2,2, byrow=T))

‘TeachingTools’ contains the function shadowtext(), which is used in the third and fourth plots. In layout(), the matrix() function creates a two by two matrix with 1 and 2 in the first row and 3 and 4 in the second row. By default, matrix() reads data into the matrix by columns rather than by rows, so the argument ‘byrow=T’ is necessary to read the data in by rows. The four plots will be created one at a time below. The first plot will be put into the location with value 1, the second with value 2, and so forth.

The first plot is of bankruptcy data, from the US courts website http://www.uscourts.gov/statistics-reports/caseload-statistics-data-tables, and gives types of bankruptcies in Iowa from 2001 to 2015. Most bankruptcies caused by medical expenses are chapter 7 nonbusiness bankruptcies and most of chapter 7 nonbusiness bankruptcies have medical expenses as a contributing factor. The data is given first.

> bankruptcies_ia
Year overalltotal total7 total11 total12 total13 bustotal bus7 bus11
1 2001 11076 10459 28 7 582 289 243 28
2 2002 11808 11186 26 9 587 354 309 25
3 2003 12582 11895 28 12 647 323 264 27
4 2004 13082 12430 23 1 628 360 324 23
5 2005 18709 17771 18 6 914 455 412 18
6 2006 4891 4316 11 3 561 208 182 9
7 2007 7036 6275 11 6 744 243 213 11
8 2008 8125 7383 14 2 726 342 317 14
9 2009 10171 9354 21 5 789 384 351 21
10 2010 9829 9013 36 10 770 381 329 33
11 2011 7965 7231 24 10 700 356 314 24
12 2012 6411 5803 19 9 578 255 223 17
13 2013 5747 5240 10 4 493 230 209 10
14 2014 5079 4607 20 4 448 170 139 19
15 2015 4535 4054 8 14 459 187 154 8
bus12 bus13 nonbustotal nonbus7 nonbus11 nonbus13
1 7 11 10787 10216 0 571
2 9 11 11454 10877 1 576
3 12 20 12259 11631 1 627
4 1 12 12722 12106 0 616
5 6 19 18254 17359 0 895
6 3 14 4683 4134 2 547
7 6 13 6793 6062 0 731
8 2 9 7783 7066 0 717
9 5 5 9787 9003 0 784
10 10 9 9448 8684 3 761
11 10 8 7609 6917 0 692
12 9 5 6156 5580 2 573
13 4 7 5517 5031 0 486
14 4 8 4909 4468 1 440
15 14 11 4348 3900 0 448

Next the plot is created. In the function plot(), the first argument is the vector of x values and the second the vector of y values. In the object ‘bankruptcies_ia’, the variable ‘Year’ is in the first column and chapter 7 nonbusiness bankruptcies are in the thirteenth column. The bankruptcy numbers are divided by 1000 for clarity in the y axis labels. The type of plot is set to a line plot with the argument ‘type=”l”‘. The x and y labels are set by ‘xlab’ and ‘ylab’. The argument ‘main’, gives the heading of the plot. By having the heading on two lines in the call, the heading plots on two lines in the plot. The color and width of the plotted line are given by ‘col=”red4″‘ and ‘lwd=”2″‘. The color of the main heading is given by ‘col.main=”darkred”‘. The color of the labels and axes are also similarly set to “darkred”. The box around the plot and the tick marks are set to grey with the argument ‘fg=”grey”‘. The font for the heading is set to italic with the argument ‘font.main=3’. The limits of the y axis are set to zero and twenty with ‘ylim=c(0,20)’.

plot(bankruptcies_ia[,1], bankruptcies_ia[,13]/1000, type="l",
xlab= "Year",
ylab= "Thousand", main = "Number of Chapter 7 Nonbusiness
Bankruptcies in IA 2001 to 2015", col="red4", lwd="2",
col.main="darkred", col.lab="darkred", col.axis="darkred",
fg="grey", font.main=3, ylim=c(0,20))

The plot is given below. Note, the plot is very plain.

plot 1

To add interest to the plot, the area below the line is filled with a polygon of diagonal lines, done with a call to the function polygon(). In R, new plotting commands can continue to add to a plot until a new plot() function is called (unless R is told not to refresh on a new plot() call). The first argument to polygon() is the vector of x values and the second argument is the vector of y values, giving the vertices of the polygon. In the bankruptcy plot, the polygon starts and ends at the lower left corner. The ‘density’ argument gives the density of the diagonal lines. The color of the lines is set to ‘”red4″‘ and the line width to ‘1’.

polygon(c(2001,2001:2015, 2015),
c(0, bankruptcies_ia[,13]/1000, 0),
density=8, col="red4", lwd=1)

The plot is given below. Note, the plot is easier to evaluate now.

plot 2

The second plot is of property taxes in the Iowa for the years 2004 to 2014. The source of the data is at https://www2.census.gov/govs/local/. See the previous blog post for more information on the individual, yearly files. There is nothing new in the code for the second plot so the code is given without comment, starting with the data.

deflated.property.tax
[1] 3566090 3573975 3557993 3696317 3720224 4043835 4167853 4275320
[9] 4302292 4341761 4344914

plot(2004:2014, deflated.property.tax/1000000, ylim=c(0,6),
ylab="Billion Dollars", xlab="Year", col="red4",
lwd="2", main="Total IA Property Taxes\n2004 to 2014",
col.main="darkred",
col.lab="darkred", col.axis="darkred",
font.main=3, fg="grey", type="l")
polygon(c(2004,2004:2014, 2014),
c(0, deflated.property.tax/1000000, 0),
density=8, col="red4", lwd=1)

The third plot is of non-capital expense spending on hospitals in Iowa from 2004 to 2014. The source of the data is the same as for the previous plot.

The plot is a little more complex. The call to plot() is straightforward. However, the call to plot() is followed by a call to lines(), which plots a second line, followed by two calls to polygon(), for each of two polygons. The data is given below.

> deflated.hospital.expend
year sl s l
[1,] 2004 1644823 734475.9 910347.6
[2,] 2005 1720450 725427.8 995022.7
[3,] 2006 1816380 805646.4 1010733.1
[4,] 2007 1896899 818614.6 1078284.9
[5,] 2008 2040809 854933.6 1185875.8
[6,] 2009 2311303 956767.1 1354536.3
[7,] 2010 2365028 955301.7 1409726.0
[8,] 2011 2470684 1030070.0 1440613.7
[9,] 2012 2643324 1254339.0 1388984.7
[10,] 2013 2676588 1270130.9 1406456.6
[11,] 2014 2796166 1342094.1 1454072.3
>
>
> deflated.hospital.capital.expend
year sl s l
[1,] 2004 125433 76164 49269
[2,] 2005 155842 76703 79139
[3,] 2006 128063 73105 54958
[4,] 2007 170249 96844 73405
[5,] 2008 172931 86148 86783
[6,] 2009 226999 103459 123540
[7,] 2010 223355 59911 163444
[8,] 2011 254702 87990 166712
[9,] 2012 317178 140569 176609
[10,] 2013 315689 126562 189127
[11,] 2014 322188 146256 175932
>

First, two lines, one for state plus local and one for local are plotted. See the plot below the code.

plot(2004:2014, deflated.hospital.expend[,2]/1000000 -
deflated.hospital.capital.expend[,2]/1000000,
ylim=c(0,3.5), ylab="Billion Dollars", xlab="Year",
col=c("red4"), col.main="darkred",
col.lab="darkred", col.axis="darkred",
main="IA Government Hospital Expenditures\n2004 to 2014",
lwd="2", font.main=3, type="l", fg="grey")
lines(2004:2014, deflated.hospital.expend[,4]/1000000-
deflated.hospital.capital.expend[,4]/1000000, lwd="2",
col=c("red1"))

plot 3

Next the two polygons are plotted. The first call to polygon() plots the top polygon. Note that the angle of the lines is 135 degrees rather than the default 45 degrees used in the first two plots. The second call to polygon() plots the bottom polygon. For the second polygon, the angle is 45 degrees and the color is ‘red1’ rather than ‘red4’. By plotting the lower polygon second, the color of the lower line is still  ‘red1’.

polygon(c(2004,2004:2014, 2014:2004),
c(deflated.hospital.expend[1,4]/1000000-
deflated.hospital.capital.expend[1,4]/1000000,
deflated.hospital.expend[,2]/1000000-
deflated.hospital.capital.expend[,2]/1000000,
deflated.hospital.expend[11:1,4]/1000000-
deflated.hospital.capital.expend[11:1,4]/1000000),
density=8, col="red4", lwd=1, angle=135)
polygon(c(2004,2004:2014, 2014),
c(0, deflated.hospital.expend[,4]/1000000-
deflated.hospital.capital.expend[,4]/1000000,
0), density=8, col="red1", lwd=1)

plot 4

Last, some text is added to the plot. The function shadowtext() puts a shadow around text and is found in the package ‘TeachingDemos’. The first argument of shadowtext() is the placement of the text on the x axis, the second on the y axis, the third the text itself. The argument ‘bg=”white”‘ sets a white shadow. The argument ‘r=.3’ sets the size of the shadow as a proportion of the size of the text. The function ‘text’ is similar to ‘shadowtext’, except there is not shadow argument. See the plot below the code.

shadowtext(2011, .75,"Local", col="red1", bg="white", r=.3, font=1.5)
shadowtext(2009, 1.7,"State", col="red4", bg="white", r=.3, font=1.5)
text(2007.5, 2.9,"Minus Capital Outlays", col="darkblue", font=1.5)

plot 5

The fourth plot is of health expenditures by the government in Iowa.  The code does not use anything new, so no comments are made on the code. The data is from the same source as the last two plots and is given below.

> deflated.health.expend
Year StateLocal State Local
[1,] 2004 416871.7 104597.30 312274.4
[2,] 2005 418761.0 98568.17 320192.9
[3,] 2006 466162.7 105526.68 360636.0
[4,] 2007 488350.4 130628.21 357722.2
[5,] 2008 509933.5 135971.63 373961.9
[6,] 2009 538308.7 145292.52 393016.1
[7,] 2010 530603.7 135314.36 395289.3
[8,] 2011 525645.7 129333.71 396312.0
[9,] 2012 568904.7 132340.68 436564.1
[10,] 2013 410256.4 131329.61 278926.8
[11,] 2014 391885.7 124200.98 267684.7
>

The code follows.  Below the code is the final figure, with the four plots.


plot(2004:2014, deflated.health.expend[,2]/1000000, ylim=c(0,.8),
ylab="Billion Dollars", xlab="Year",
col="red4", font.main=3, fg="grey",
col.main="darkred",
col.lab="darkred", col.axis="darkred", type="l",
main="IA Government Health Expenditures\n2004 to 2014",
lwd="2")
lines(2004:2014, deflated.health.expend[,4]/1000000, lwd="2", col="red1")
polygon(c(2004,2004:2014, 2014:2004),
c(deflated.health.expend[1,4]/1000000,
deflated.health.expend[,2]/1000000,
deflated.health.expend[11:1,4]/1000000),
density=8, col="red4", angle=135, lwd="1")
polygon(c(2004, 2004:2014, 2014),
c(0, deflated.health.expend[,4]/1000000, 0),
density=8, col="red1", lwd=1)
shadowtext(2011, .2,"Local", col="red1", bg="white", r=.3, font=1.5)
shadowtext(2008.5, .45,"State", col="red4", bg="white", r=.3, font=1.5)

 

plot 6

That’s it!!

Advertisement

Pulling Data Out of Census Spreadsheets Using R

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14))
hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.


for (i in 1:11)
{
fn = paste("http://www2.census.gov/govs/local/",years[i],
"slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.


ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise. 

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.


ssi1 = which(grepl("*Hospital*", ex, ignore.case=T))
ssi2 = which(grepl("Iowa", ex, ignore.case=T))
ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.


if (ssi2<80) ssi5=ssi2+0:2
else ssi5 = ssi2 + c(0,2,3)
df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F)
names(df.1)=c(NA,NA,NA)
hospital.capital.expend = rbind(hospital.capital.expend, df.1)
rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5)
}

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.


hospital.capital.expend = as.matrix(hospital.capital.expend[-1,])
hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3)
hospital.capital.expend = cbind(2004:2014,hospital.capital.expend)
colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local")
print(hospital.capital.expend)

That’s it!!!