Are Atlantic Storms Increasing?

I fit three simple linear regression models to the logarithm of the number of storms plus one for the number of tropical storms, hurricanes, and major hurricanes from 1851 to 2014.  The data can be found at

https://www.nhc.noaa.gov/climo/images/AtlanticStormTotalsTable.pdf

The results of the regressions – done in R – are below, as well as a plot of the models and data, where the fitted lines are found by taking the exponential of the fit to the model and subtracting one from the result.  The blue lines are the fit and the green lines are 96.6% prediction intervals.

Rplot

The models fit (to the log of the number plus one) are below.  The data indicate a 0.39% cumulative increase in the number of tropical storms per year, a 0.19% cumulative increase in the number of hurricanes per year, and a 0.47% cumulative increase in the number of major hurricanes per year over the time period 1851 to 2014.

The Three Models:

Tropical Storms over time:

Call:
lm(formula = lds[[i]] ~ ds[[1]])
Residuals:
Min         1Q          Median         3Q              Max
-1.50491    -0.24411    0.02969        0.22772         0.90256
Coefficients:
            Estimate    Std. Error  t value   Pr(>|t|)
(Intercept) -5.237098    1.117248   -4.687    5.83e-06 ***
ds[[1]]      0.003885    0.000578    6.721    2.92e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3504 on 162 degrees of freedom

Multiple R-squared: 0.2181, Adjusted R-squared: 0.2132

F-statistic: 45.17 on 1 and 162 DF,   p-value: 2.922e-10

Hurricanes over time:

Call:
lm(formula = lds[[i]] ~ ds[[1]])
Residuals:
Min        1Q         Median    3Q         Max
-1.74129   -0.24892   0.01169   0.27125    0.85703
Coefficients:
            Estimate     Std. Error   t value   Pr(>|t|)
(Intercept) -1.9239955    1.3290087   -1.448   0.14964
ds[[1]]      0.0019150    0.0006875    2.785   0.00598 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4168 on 162 degrees of freedom

Multiple R-squared: 0.0457, Adjusted R-squared: 0.03981

F-statistic: 7.758 on 1 and 162 DF,   p-value: 0.005984

Major Hurricanes over time:

Call:
lm(formula = lds[[i]] ~ ds[[1]])
Residuals:
Min        1Q         Median    3Q       Max
-1.2755    -0.3994    0.0403    0.3947   1.0793
Coefficients:
            Estimate   Std. Error  t value  Pr(>|t|)
(Intercept) -8.1698419 1.6829089   -4.855   2.82e-06 ***
ds[[1]]      0.0046922 0.0008706    5.390   2.45e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5278 on 162 degrees of freedom

Multiple R-squared: 0.152, Adjusted R-squared: 0.1468

F-statistic: 29.05 on 1 and 162 DF,   p-value: 2.454e-07

Comments:

It is known that we are doing a better job of counting storms than in the 19th century, since some storms never make landfall, so would not necessarily have been counted in the 1800’s or early 1900’s.  However, the largest increase above is for major hurricanes – which should have been measured quite well over the full time period of the data.

The R Code:

atl.strm.plot.fun <- function(ds = atl.strm) {

print(ds[1:10,])
lds = log(ds[,2:4]+1)
mod.log.ds = list(1,2,3)
pred.ds = list(1,2,3)

for (i in 1:3) {
mod.log.ds[[i]] = lm(lds[[i]]~ds[[1]])
pred.ds[[i]] = predict(mod.log.ds[[i]], interval="predict", level=0.966)
}

pred.ds = lapply(pred.ds,exp)
pred.ds = lapply(pred.ds,"-",1)

par(mfrow=c(3,1), mar=c(4,4,2,1), oma=c(1,1,3,1))

for (i in 1:3) {
plot(ds[[1]],ds[[i+1]], ylab=colnames(ds)[i+1], xlab="Year", cex=1)
lines(ds[[1]],pred.ds[[i]][,1], col="blue")
lines(ds[[1]],pred.ds[[i]][,2], col="green")
lines(ds[[1]],pred.ds[[i]][,3], col="green")
}

mtext("Atlantic Storms", side=3, font=1, outer=T, line=1)

lapply(mod.log.ds,summary)

}

The First Ten Lines of the Data Set:

   Year Tropical_Storms  Hurricanes  Major_Hurricanes
1  1851       6              3             1
2  1852       5              5             1
3  1853       8              4             2
4  1854       5              3             1
5  1855       5              4             1
6  1856       6              4             2
7  1857       4              3             0
8  1858       6              6             0
9  1859       8              7             1
10 1860       7              6             1
Advertisements

I Am Back

Sorry for my long absence. I had some health problems. I will get back to the economic problem soon.

Storms

Here is a plot of the costs of weather disasters in the USA since 1980 – data from the NOAA website, https://www.ncdc.noaa.gov/billions/time-series . These are the disasters costing more than $1,000,000,000. I fit the exponential trend by fitting a linear trend to the log of the cost data plus 1 (since there was a year with no billion dollar storms and the log of zero is minus infinity, while the log of 1 is zero), then taking the exponential of the fitted line and subtracting 1 from the result. I find the result scary, particularly with the costs of 2005 and 2017. These costs are inflation adjusted.

The Greenhouse Effect

Here is a plot that I did last year of a time series model, where I fit temperature anomaly to sunspot numbers, carbon dioxide levels, and the ENSO index.

Here is the R output.

Call:

arima(x = temp.anomaly.7.71.to.11.13, order = c(3, 0, 0), seasonal = c(2, 0, 0), xreg = xmat)

Coefficients:

ar1 ar2 ar3 sar1 sar2 intercept enso co2 sunspot

0.4738 0.1751 0.0954 0.1187 0.1391 -2.9108 0.0563 0.0089 5e-04

s.e. 0.0241 0.0265 0.0242 0.0246 0.0243 0.1573 0.0074 0.0005 2e-04

sigma^2 estimated as 0.01281: log likelihood = 1297.64, aic = -2575.27

The temperature anomaly is from Berkeley Earth of monthly global average temperature data for land and ocean, http://berkeleyearth.org/data. The temperature series goes from July 1871 to November 2013.

The data for the sunspots come from the WDC-SILSO at the Royal Observatory of Belgium, Brussels, http://sidc.oma.be/silso/datafiles, where the data is the previous data from the site (the data changed in July of 2015). The sunspot series goes from April 1871 to August 2013.

The data for the CO2 level is from the NOAA website, The CO2 level has been smoothed by a twelve month moving average and some work went into combining the two series. See https://vanwardstat.wordpress.com/2015/08/14/fitting-sunspot-numbers-and-co2-concentrations-to-temperature-anomalies/ The data go from July 1871 to November 2013.

The index for the ENSO, at NOAA, has been averaged for the years for which the two data sets overlap and the resulting series has been shifted to the right by one half month using the average of two consecutive values. The data are at https://www.esrl.noaa.gov/psd/enso/mei.ext/table.ext.html and https://www.esrl.noaa.gov/psd/enso/mei/table.html. The seasonal component of the ENSO index was removed with a twelve month moving average. The data run from July 1871 to November 2013.

This is a quick and dirty regression. That the model fits so well is remarkable. Maybe someone who is getting paid for the work will do a more careful job.

Wages and Inflation 1964 to 2017

Many years ago, in the mid 1970’s, I was living during a time of high inflation and energy price shocks.  At that time, I thought that the energy shocks and inflation were causing real wage income to decrease, by way of inflation, and that the distribution of wealth was shifting in such a way that a person’s control over their income picked the winners and losers.

I did some research on the topic for my creative component at Iowa State in 1985 – for my master’s degree – but the subject has lain fallow in my mind since then, with my having made occasional forays into the data.  This the the beginning of looking at the problem again. Below are CPI-All Urban Consumers (Current Series),  https://data.bls.gov/timeseries/CUUR0000SA0 , and three sets of data from https://www.bls.gov/webapps/legacy/cesbtab8.htm  – bullet list 8, giving average hourly wages for all non-agricultural private workers, goods producers, and service providers.  All the datasets contain monthly figures from January 1964 to October 2017 and are on the Bureau of Labor Statistics website.

inflation and average adjusted wages 1964 to 2017

From the plots, real income was increasing strongly up until around the 1973 Oil Embargo, dropped, rose, fell until the mid to late 1990’s and have been on an upward trend since then, except for the last few years.  Also, producers of goods make more money than service providers, on average.

More on this next month.

R Plots Using plot()

In this post, an example of creating multiple plots on a page, using the plot() function in R, is presented. Four plots are generated using data from some government websites, including the data found in the blog of last month.

To start, the library ‘TeachingTools’ is loaded. Then the layout for the plots is entered.

library("TeachingTools")
layout(matrix(c(1,2,3,4),2,2, byrow=T))

‘TeachingTools’ contains the function shadowtext(), which is used in the third and fourth plots. In layout(), the matrix() function creates a two by two matrix with 1 and 2 in the first row and 3 and 4 in the second row. By default, matrix() reads data into the matrix by columns rather than by rows, so the argument ‘byrow=T’ is necessary to read the data in by rows. The four plots will be created one at a time below. The first plot will be put into the location with value 1, the second with value 2, and so forth.

The first plot is of bankruptcy data, from the US courts website http://www.uscourts.gov/statistics-reports/caseload-statistics-data-tables, and gives types of bankruptcies in Iowa from 2001 to 2015. Most bankruptcies caused by medical expenses are chapter 7 nonbusiness bankruptcies and most of chapter 7 nonbusiness bankruptcies have medical expenses as a contributing factor. The data is given first.

> bankruptcies_ia
Year overalltotal total7 total11 total12 total13 bustotal bus7 bus11
1 2001 11076 10459 28 7 582 289 243 28
2 2002 11808 11186 26 9 587 354 309 25
3 2003 12582 11895 28 12 647 323 264 27
4 2004 13082 12430 23 1 628 360 324 23
5 2005 18709 17771 18 6 914 455 412 18
6 2006 4891 4316 11 3 561 208 182 9
7 2007 7036 6275 11 6 744 243 213 11
8 2008 8125 7383 14 2 726 342 317 14
9 2009 10171 9354 21 5 789 384 351 21
10 2010 9829 9013 36 10 770 381 329 33
11 2011 7965 7231 24 10 700 356 314 24
12 2012 6411 5803 19 9 578 255 223 17
13 2013 5747 5240 10 4 493 230 209 10
14 2014 5079 4607 20 4 448 170 139 19
15 2015 4535 4054 8 14 459 187 154 8
bus12 bus13 nonbustotal nonbus7 nonbus11 nonbus13
1 7 11 10787 10216 0 571
2 9 11 11454 10877 1 576
3 12 20 12259 11631 1 627
4 1 12 12722 12106 0 616
5 6 19 18254 17359 0 895
6 3 14 4683 4134 2 547
7 6 13 6793 6062 0 731
8 2 9 7783 7066 0 717
9 5 5 9787 9003 0 784
10 10 9 9448 8684 3 761
11 10 8 7609 6917 0 692
12 9 5 6156 5580 2 573
13 4 7 5517 5031 0 486
14 4 8 4909 4468 1 440
15 14 11 4348 3900 0 448

Next the plot is created. In the function plot(), the first argument is the vector of x values and the second the vector of y values. In the object ‘bankruptcies_ia’, the variable ‘Year’ is in the first column and chapter 7 nonbusiness bankruptcies are in the thirteenth column. The bankruptcy numbers are divided by 1000 for clarity in the y axis labels. The type of plot is set to a line plot with the argument ‘type=”l”‘. The x and y labels are set by ‘xlab’ and ‘ylab’. The argument ‘main’, gives the heading of the plot. By having the heading on two lines in the call, the heading plots on two lines in the plot. The color and width of the plotted line are given by ‘col=”red4″‘ and ‘lwd=”2″‘. The color of the main heading is given by ‘col.main=”darkred”‘. The color of the labels and axes are also similarly set to “darkred”. The box around the plot and the tick marks are set to grey with the argument ‘fg=”grey”‘. The font for the heading is set to italic with the argument ‘font.main=3’. The limits of the y axis are set to zero and twenty with ‘ylim=c(0,20)’.

plot(bankruptcies_ia[,1], bankruptcies_ia[,13]/1000, type="l",
xlab= "Year",
ylab= "Thousand", main = "Number of Chapter 7 Nonbusiness
Bankruptcies in IA 2001 to 2015", col="red4", lwd="2",
col.main="darkred", col.lab="darkred", col.axis="darkred",
fg="grey", font.main=3, ylim=c(0,20))

The plot is given below. Note, the plot is very plain.

plot 1

To add interest to the plot, the area below the line is filled with a polygon of diagonal lines, done with a call to the function polygon(). In R, new plotting commands can continue to add to a plot until a new plot() function is called (unless R is told not to refresh on a new plot() call). The first argument to polygon() is the vector of x values and the second argument is the vector of y values, giving the vertices of the polygon. In the bankruptcy plot, the polygon starts and ends at the lower left corner. The ‘density’ argument gives the density of the diagonal lines. The color of the lines is set to ‘”red4″‘ and the line width to ‘1’.

polygon(c(2001,2001:2015, 2015),
c(0, bankruptcies_ia[,13]/1000, 0),
density=8, col="red4", lwd=1)

The plot is given below. Note, the plot is easier to evaluate now.

plot 2

The second plot is of property taxes in the Iowa for the years 2004 to 2014. The source of the data is at https://www2.census.gov/govs/local/. See the previous blog post for more information on the individual, yearly files. There is nothing new in the code for the second plot so the code is given without comment, starting with the data.

deflated.property.tax
[1] 3566090 3573975 3557993 3696317 3720224 4043835 4167853 4275320
[9] 4302292 4341761 4344914

plot(2004:2014, deflated.property.tax/1000000, ylim=c(0,6),
ylab="Billion Dollars", xlab="Year", col="red4",
lwd="2", main="Total IA Property Taxes\n2004 to 2014",
col.main="darkred",
col.lab="darkred", col.axis="darkred",
font.main=3, fg="grey", type="l")
polygon(c(2004,2004:2014, 2014),
c(0, deflated.property.tax/1000000, 0),
density=8, col="red4", lwd=1)

The third plot is of non-capital expense spending on hospitals in Iowa from 2004 to 2014. The source of the data is the same as for the previous plot.

The plot is a little more complex. The call to plot() is straightforward. However, the call to plot() is followed by a call to lines(), which plots a second line, followed by two calls to polygon(), for each of two polygons. The data is given below.

> deflated.hospital.expend
year sl s l
[1,] 2004 1644823 734475.9 910347.6
[2,] 2005 1720450 725427.8 995022.7
[3,] 2006 1816380 805646.4 1010733.1
[4,] 2007 1896899 818614.6 1078284.9
[5,] 2008 2040809 854933.6 1185875.8
[6,] 2009 2311303 956767.1 1354536.3
[7,] 2010 2365028 955301.7 1409726.0
[8,] 2011 2470684 1030070.0 1440613.7
[9,] 2012 2643324 1254339.0 1388984.7
[10,] 2013 2676588 1270130.9 1406456.6
[11,] 2014 2796166 1342094.1 1454072.3
>
>
> deflated.hospital.capital.expend
year sl s l
[1,] 2004 125433 76164 49269
[2,] 2005 155842 76703 79139
[3,] 2006 128063 73105 54958
[4,] 2007 170249 96844 73405
[5,] 2008 172931 86148 86783
[6,] 2009 226999 103459 123540
[7,] 2010 223355 59911 163444
[8,] 2011 254702 87990 166712
[9,] 2012 317178 140569 176609
[10,] 2013 315689 126562 189127
[11,] 2014 322188 146256 175932
>

First, two lines, one for state plus local and one for local are plotted. See the plot below the code.

plot(2004:2014, deflated.hospital.expend[,2]/1000000 -
deflated.hospital.capital.expend[,2]/1000000,
ylim=c(0,3.5), ylab="Billion Dollars", xlab="Year",
col=c("red4"), col.main="darkred",
col.lab="darkred", col.axis="darkred",
main="IA Government Hospital Expenditures\n2004 to 2014",
lwd="2", font.main=3, type="l", fg="grey")
lines(2004:2014, deflated.hospital.expend[,4]/1000000-
deflated.hospital.capital.expend[,4]/1000000, lwd="2",
col=c("red1"))

plot 3

Next the two polygons are plotted. The first call to polygon() plots the top polygon. Note that the angle of the lines is 135 degrees rather than the default 45 degrees used in the first two plots. The second call to polygon() plots the bottom polygon. For the second polygon, the angle is 45 degrees and the color is ‘red1’ rather than ‘red4’. By plotting the lower polygon second, the color of the lower line is still  ‘red1’.

polygon(c(2004,2004:2014, 2014:2004),
c(deflated.hospital.expend[1,4]/1000000-
deflated.hospital.capital.expend[1,4]/1000000,
deflated.hospital.expend[,2]/1000000-
deflated.hospital.capital.expend[,2]/1000000,
deflated.hospital.expend[11:1,4]/1000000-
deflated.hospital.capital.expend[11:1,4]/1000000),
density=8, col="red4", lwd=1, angle=135)
polygon(c(2004,2004:2014, 2014),
c(0, deflated.hospital.expend[,4]/1000000-
deflated.hospital.capital.expend[,4]/1000000,
0), density=8, col="red1", lwd=1)

plot 4

Last, some text is added to the plot. The function shadowtext() puts a shadow around text and is found in the package ‘TeachingDemos’. The first argument of shadowtext() is the placement of the text on the x axis, the second on the y axis, the third the text itself. The argument ‘bg=”white”‘ sets a white shadow. The argument ‘r=.3’ sets the size of the shadow as a proportion of the size of the text. The function ‘text’ is similar to ‘shadowtext’, except there is not shadow argument. See the plot below the code.

shadowtext(2011, .75,"Local", col="red1", bg="white", r=.3, font=1.5)
shadowtext(2009, 1.7,"State", col="red4", bg="white", r=.3, font=1.5)
text(2007.5, 2.9,"Minus Capital Outlays", col="darkblue", font=1.5)

plot 5

The fourth plot is of health expenditures by the government in Iowa.  The code does not use anything new, so no comments are made on the code. The data is from the same source as the last two plots and is given below.

> deflated.health.expend
Year StateLocal State Local
[1,] 2004 416871.7 104597.30 312274.4
[2,] 2005 418761.0 98568.17 320192.9
[3,] 2006 466162.7 105526.68 360636.0
[4,] 2007 488350.4 130628.21 357722.2
[5,] 2008 509933.5 135971.63 373961.9
[6,] 2009 538308.7 145292.52 393016.1
[7,] 2010 530603.7 135314.36 395289.3
[8,] 2011 525645.7 129333.71 396312.0
[9,] 2012 568904.7 132340.68 436564.1
[10,] 2013 410256.4 131329.61 278926.8
[11,] 2014 391885.7 124200.98 267684.7
>

The code follows.  Below the code is the final figure, with the four plots.


plot(2004:2014, deflated.health.expend[,2]/1000000, ylim=c(0,.8),
ylab="Billion Dollars", xlab="Year",
col="red4", font.main=3, fg="grey",
col.main="darkred",
col.lab="darkred", col.axis="darkred", type="l",
main="IA Government Health Expenditures\n2004 to 2014",
lwd="2")
lines(2004:2014, deflated.health.expend[,4]/1000000, lwd="2", col="red1")
polygon(c(2004,2004:2014, 2014:2004),
c(deflated.health.expend[1,4]/1000000,
deflated.health.expend[,2]/1000000,
deflated.health.expend[11:1,4]/1000000),
density=8, col="red4", angle=135, lwd="1")
polygon(c(2004, 2004:2014, 2014),
c(0, deflated.health.expend[,4]/1000000, 0),
density=8, col="red1", lwd=1)
shadowtext(2011, .2,"Local", col="red1", bg="white", r=.3, font=1.5)
shadowtext(2008.5, .45,"State", col="red4", bg="white", r=.3, font=1.5)

 

plot 6

That’s it!!

Pulling Data Out of Census Spreadsheets Using R

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14))
hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.


for (i in 1:11)
{
fn = paste("http://www2.census.gov/govs/local/",years[i],
"slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.


ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise. 

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.


ssi1 = which(grepl("*Hospital*", ex, ignore.case=T))
ssi2 = which(grepl("Iowa", ex, ignore.case=T))
ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.


if (ssi2<80) ssi5=ssi2+0:2
else ssi5 = ssi2 + c(0,2,3)
df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F)
names(df.1)=c(NA,NA,NA)
hospital.capital.expend = rbind(hospital.capital.expend, df.1)
rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5)
}

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.


hospital.capital.expend = as.matrix(hospital.capital.expend[-1,])
hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3)
hospital.capital.expend = cbind(2004:2014,hospital.capital.expend)
colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local")
print(hospital.capital.expend)

That’s it!!!

Multiple Correspondence Analysis: A Political Example

MULTIPLE CORRESPONDENCE ANALYSIS

Correspondence analysis and multiple correspondence analysis are techniques from multivariate analysis. You can use the techniques to find clusters in a data set. Correspondence and multiple correspondence analysis are similar to principal component analysis, in that the analysis attempts to reduce the dimensions (number of columns or rows) of a set of intercorrelated variables so that the smaller dimensioned (number of columns or rows) variables explain most of the variation in the original variables. However, correspondence and multiple correspondence analysis are for categorical variables rather than the numerical variables of principal component analysis. Correspondence analysis was developed by Jean Paul Benzecri and measures similarities of patterns in contingency tables.

The Mathematics Behind Correspondence Analysis

Correspondence analysis is used in the analysis of just two categorical variables. In correspondence analysis, the reduced variables are found by applying singular value decomposition to a transformation of the contingency table created from the two original variables. The transformation replaces the value in each cell of the contingency table by the original value minus the product of the row total and the column total divided by the overall total, with the difference divided by the square root of the product of the row total and the column total. The resulting cells then contain the signed square roots of the terms used in the calculation of the chi square test for independence for a contingency table, divided by the square root of the overall total.

The Mathematics Behind Multiple Correspondence Analysis

For multiple correspondence analysis, more than two categorical variables are reduced. The reduced set of variables is found by applying correspondence analysis to one of two matrices. The first matrix is a matrix made up of observations in the rows and indicator variables in the columns, where the indicator variables take on the value one if the observation has a quality measured by the variable and zero if the individual does not. For example, say there are three variables, ‘gender’, ‘hair color’, and ‘skin tone’. Say that the categories for gender are ‘female’, ‘male’, ‘prefer not to answer’; for hair color, ‘red’, ‘blond’, ‘brown’, ‘black’; and for skin tone, ‘light’, ‘medium’, and ‘dark’, then, there would be three columns associated with gender and, for a given person, only one would contain a one, the others would contain zeros; there would be four columns associated with hair color and, for a given person, only one would contain a one, others would contain zeros; and there would be three columns associated with skin tone and, for a given person, only one would contain a one, the others would contain zeros.

The second type of matrix is a Burt table. A Burt table is multiple sort of contingency table. The contingency tables between the variables make up blocks of the matrix. From the example above, the first block is the contingency table of gender by gender, and is made of up of a diagonal matrix with the counts of males, females, and those who did not want to answer on the diagonal. The second block, going horizontally, is the contingency table of gender by hair color. The third block, going horizontally, is the contingency table of gender by skin tone. The second block, going vertically, is the contingency table of hair color by gender. The rest of the blocks are found similarly.

Plotting for Clustering

Once the singular value decomposition is done and the reduced variables are found, the variables are usually plotted to look for clustering of attributes. (For the above example, some of the attributes are brown hair, male, red hair, light skin tone, each of which would be one point on the plot.) Usually just the first two dimensions of the reduced matrix are plotted, though more dimensions can be plotted. The dimensions are ordered with respect to how much of the variation in the input matrix the dimension explains, so the first two dimensions are the dimensions that explain the most variation. With correspondence analysis, the reduced dimensions are with respect to the contingency table. Both the attributes of the rows and the attributes of the columns are plotted on the same plot. For multiple correspondence analysis, the reduced dimensions are with respect to the matrix used in the calculation. If one uses the indicator variable matrix, one can plot just the attributes for the columns or one can also plot labels for the rows on the same plot (or plot just the labels for the rows). If one uses the Burt table, one can only plot attributes for the columns. For multiple correspondence analysis, the plots for the columns are the same by either method, thought the scaling may be different.

Interpreting the Plot

The interpretation of the row and column variables in correspondence anaylsis is done separately. Row variables are compared as to level down columns and column variables are compared as to level across rows. However if a row variable is near a column variable in the plot, then both are represented at similar relative levels in the contingency table.

The interpretation of the relationship between the variables in the indicator or Burt table is a bit subtle. Within an attribute, the levels of the attribute are compared to each other on the plot. Between attributes, points that are close together are seen at similar levels.

An Example Using the Deficit by Political Party Data Set

Below is a multiple correspondence plot for which only the column reduction is plotted. We look at the size of the deficit (-) / surplus (+) using the political affiliation of the President and the controlling parties of the Senate and the House of Representatives. The size of the onbudget budget deficit (-) / surplus(+) as a percentage of gross domestic product for the years 1947 to 2008 was classified into four classes. The largest deficit over the years was -6.04 percent and the largest surplus was 4.11 percent. The class breaks were -6.2, -4, -2, 0, and 4.2, which gives four classes. (There was only one observation with a surplus greater than 2, so years of surplus were all classed together.) Each of the President, Senate, and House variables were classified into two classes, either Democrat or Republican. Since the budget is created in the year before the budget is in effect, the deficit (-) / surplus (+) percentages are associated with the political parties in power the year before the end of the budget year.

Multiple correspondence analysis plots of deficit / surplus as a percentage if GDP by party of president, senate, and house

The first principal axis appears to measure distance between the relative counts for the parties of the senate, and house. On the plot, the greatest distances on the first principal axis are between the Republican and Democratic senates and the Republican and Democratic houses. Looking at the tables below, the lowest counts were for the Republican senates and houses, which means the highest counts were for the Democratic senates and houses. For the deficit classes, classes (0,4.2] is the smallest class in the table and, like the Republican senates and houses, is to the left on the plot. Still there is not much difference between the deficit classes on the first principal axis (or the parties of the presidents).

The second principal axis appears to show how the differing levels of deficit (or surplus) are associated with the political parties of the presidents, senates, and houses. Looking at the senates and houses, there is not a lot of difference between the Democrats and the Republicans on the second principal axis, both cluster around the deficit class (-4,-2]. For the presidents, however, there is a great difference. Democratic presidents are near the deficit classes (-2,0] and (0,4.2] on the second principal axis while Republican presidents are between the deficit classes (-4,-2] and (-6.2,-4].

Below are two classifications of the data.

Deficit Classes
(-6.2,-4] (-4,-2] (-2,0] (0,4.2]
11 22 21 8
Political Parties
Party President Senate House
Democrat 27 42 48
Republican 35 20 14

In this example, we have treated the deficit, which is a numeric variable, as a categorical variable. In the post before this post, an example of principal component analysis, the same data is analyzed, with all of the variables treated as numeric rather than categorical.

Principal Component Analysis: An Example

Principal component analysis is a technique from multivariate analysis for data sets with numeric (as opposed to categorical) variables. If you are interest in how variables in a data set relate or you are dealing with multicollinearity, you would probably find principle component analysis useful. According to the Wikipedia entry on principal component analysis, principal component analysis was invented by Karl Pearson in 1901. The purpose of principal component analysis is to reduce the number of dimensions of a set of measurements.

What Principal Component Analysis Involves

Say we have a data set of four measurements on 62 observations. The data set then has four dimensions in the columns. Using principal component analysis, we can find four vectors composed of weighted linear combinations of the original four measurements which contain the same information as in the original four measurement vectors, however, which combinations are linearly independent. (If a set of vectors are linearly independent, then the cross-product matrix of the vectors is diagonal.) Usually, the first few linear combinations explain most of the variation in the original data set. Say, in our example, that the first two linear combinations explain 85% of the variation in the data set. Then, we could approximate the original data using the first two vectors. The vectors are called either the first two principal components or the scores of the first two principal components. The weights used to find the scores are called the loadings of the principal components.

The principal components can be used in regression analysis. In the example above, the first two principal components could be used in place of the four original measurements to get rid of problems from multicollinearity. The principal components can also be used in cluster analysis. By plotting the first two scores against each other and labeling the points with the row labels, one can see if different observations cluster together. By plotting the first two loadings against each other, we can see if there is clustering within the measurement variables. The principal components can be used to see underlying relationships within the data. The loadings can have meaning in terms of understanding the data.

The Mathematics Behind Principal Component Analysis

Given a data set, the loadings are the eigenvectors of either the variance-covariance matrix or the correlation matrix of the data set. The two methods do not give the same result. Usually, the correlation matrix is used because in the correlation matrix the variables are normalized for size and spread. The scores are found by multiplying the original data – the normalized data if the correlation matrix is used – on the right by the eigenvectors. In the matrix of eigenvectors, the eigenvectors are in the columns.

Eigenvalues have an important role in principal component analysis. Eigenvectors are sorted by the sizes of the corresponding eigenvalues, from largest to smallest. An eigenvalue using a correlation matrix is proportional to the proportion of the variation in the data set explained by the principal component associated with the eigenvalue. For a full rank correlation matrix, in most programs, the sum of the eigenvalues equals the dimension of the correlation matrix, which equals the number of columns in the data set. The magnitudes of the elements of the eigenvectors are relative with respect to the magnitudes of the eigenvalues. The magnitudes of either, separately, are not determinate. In the above explanation, we assume the eigenvectors are normalized such that the inner product of each eigenvector equals one, which is usual.

An Example Using the Deficit by Political Party Data Set

We will use the data in The Deficit by Political Party Data Set* to demonstrate the power of principal component analysis as a clustering tool. We look at the size of the on budget deficit (-) (surplus (+)) as a percentage of the gross domestic product using the political affiliation of the President and the controlling parties of the Senate and the House of Representatives. Since the budget is created in the year before the budget is in effect, the deficit (-) (surplus (+)) percentages are associated with the political parties in power the year before the end of budget year. We will only look at years between the extreme deficits of the World War II years and the current extreme deficits.

In order to code the categorical values for political party numerically, we let Democrats take in the value of ‘1’ and Republicans take on the value of ‘-1’. Our data set has four variables and 62 observations. The variables are the deficit (-) (surplus (+)) as a percentage of the gross domestic product for the years 1947 to 2008, the political party of the President for the years 1946 to 2007, the political party controlling the Senate for the years 1946 to 2007, and the political party controlling the House for the years 1946 to 2007. Below are two plots of the rotation of the four variable dimensions projected onto the two dimensional plane using the first two eigenvectors, each multiplied by the square root of the eigenvector’s eigenvalue, one for the covariance method of calculation and the other for the correlation method.

First two eigenvectors plotted against each other - with and without normalization

Normally, in principal component analysis, a biplot is created containing plots of both the loadings and the scores. Because three of our variables are dichotomous, the plots of the scores are not very interesting and I have not included the plots. For the covariance method, the first two principal components account for 86% of the variation, and for the correlation method, the first two principal components account for 79% of the variation.

We can see that the parties of the Presidents and the deficits are strongly correlated with each other in either plot, while the Senates and Houses are more weakly correlated with each other in the correlation plot but strongly correlated in the covariance plot. The deficits and the parties of the Presidents do not seem to be strongly correlated with the parties of the Senates or Houses in either plot. In the covariance plot, the lengths of the vectors differ much more than in the correlation plot, since the measurements are not standardized. Basically, the two plots give the much same information.

Below is a table of the eigenvectors and eigenvalues for the correlation method.

Eigenvectors and Eigenvalues for the Correlation Method
variable eigenvector 1 eigenvector 2 eigenvector 3 eigenvector 4
deficit 0.380 0.583 0.712 0.090
president 0.378 0.587 -0.701 0.145
senate -0.539 0.487 -0.024 -0.687
house -0.650 0.279 0.029 0.706
eigenvalue 1.683 1.462 0.506 0.350

We can see from the table that the first eigenvector is a contrast between the deficit / President and the Senate / House.  Democrats have a positive value and Republicans a negative value in the normalized matrix of measurements. So, for the deficit and the party of the President, deficits that are positive in the normalized data are positively associated with Democratic presidents and deficits that are negative in the normalized data are positively associated with Republican presidents. For the House and Senate, Republican Senates are associated with Republican Houses and Democratic Senates are associated with Democratic Houses. Together, positive normalized budget deficits and Democratic presidents align with Republican Senates and Houses and contrast with Democratic Senates and Houses.

The second eigenvector lumps together, weakly, the deficit, President, Senate, and House. Here, Democrats, at all levels, are aligned with positive values of the normalized deficit and for Republicans, negative deficits are associated with Republicans at all levels.  Also, Democrats are aligned with Democrats and Republicans are aligned with Republicans.

The third eigenvector is mainly a contrast between the deficit and the party of the President. Positive normalized deficits are contrasted with Democratic presidents and aligned with Republican presidents.

The fourth eigenvector is mainly a contrast between the parties of the Senate and House. Democratic Senates are contrasted with Democratic Houses and aligned with Republican Houses and vice versa. The last two eigenvectors only account for a small portion of the variation.

*The Deficit by Political Party Data Set

The deficit by political party data set contains data on the total and on balance deficits, the gross domestic product, and the political parties controlling the presidency, the senate, and the house for the years 1940 to 2011. The data for the deficits and the gross domestic product are from the Office of Management and Budget and can be found at http://www.whitehouse.gov/omb/budget/Historicals. Before 1978, the budget year ran from July 1st to June 30th. From 1978 onward, the budget year ran from October 1st to September 31st. The tables contain a one quarter correction between the years 1977 and 1978, which I have ignored. The data for 2011 is estimated. The year with which the deficit and gross domestic product are associated is the year at the end of the budget year for which the deficit and gross domestic product are calculated.

The data for the controlling political parties of the president, senate, and house were taken from the website of Dave Manuel, http://www.davemanuel.com/history-of-deficits-and-surpluses-in-the-united-states.php, a good resource on the history of the deficit.