Principal Component Analysis: An Example

Principal component analysis is a technique from multivariate analysis for data sets with numeric (as opposed to categorical) variables. If you are interest in how variables in a data set relate or you are dealing with multicollinearity, you would probably find principle component analysis useful. According to the Wikipedia entry on principal component analysis, principal component analysis was invented by Karl Pearson in 1901. The purpose of principal component analysis is to reduce the number of dimensions of a set of measurements.

What Principal Component Analysis Involves

Say we have a data set of four measurements on 62 observations. The data set then has four dimensions in the columns. Using principal component analysis, we can find four vectors composed of weighted linear combinations of the original four measurements which contain the same information as in the original four measurement vectors, however, which combinations are linearly independent. (If a set of vectors are linearly independent, then the cross-product matrix of the vectors is diagonal.) Usually, the first few linear combinations explain most of the variation in the original data set. Say, in our example, that the first two linear combinations explain 85% of the variation in the data set. Then, we could approximate the original data using the first two vectors. The vectors are called either the first two principal components or the scores of the first two principal components. The weights used to find the scores are called the loadings of the principal components.

The principal components can be used in regression analysis. In the example above, the first two principal components could be used in place of the four original measurements to get rid of problems from multicollinearity. The principal components can also be used in cluster analysis. By plotting the first two scores against each other and labeling the points with the row labels, one can see if different observations cluster together. By plotting the first two loadings against each other, we can see if there is clustering within the measurement variables. The principal components can be used to see underlying relationships within the data. The loadings can have meaning in terms of understanding the data.

The Mathematics Behind Principal Component Analysis

Given a data set, the loadings are the eigenvectors of either the variance-covariance matrix or the correlation matrix of the data set. The two methods do not give the same result. Usually, the correlation matrix is used because in the correlation matrix the variables are normalized for size and spread. The scores are found by multiplying the original data – the normalized data if the correlation matrix is used – on the right by the eigenvectors. In the matrix of eigenvectors, the eigenvectors are in the columns.

Eigenvalues have an important role in principal component analysis. Eigenvectors are sorted by the sizes of the corresponding eigenvalues, from largest to smallest. An eigenvalue using a correlation matrix is proportional to the proportion of the variation in the data set explained by the principal component associated with the eigenvalue. For a full rank correlation matrix, in most programs, the sum of the eigenvalues equals the dimension of the correlation matrix, which equals the number of columns in the data set. The magnitudes of the elements of the eigenvectors are relative with respect to the magnitudes of the eigenvalues. The magnitudes of either, separately, are not determinate. In the above explanation, we assume the eigenvectors are normalized such that the inner product of each eigenvector equals one, which is usual.

An Example Using the Deficit by Political Party Data Set

We will use the data in The Deficit by Political Party Data Set* to demonstrate the power of principal component analysis as a clustering tool. We look at the size of the on budget deficit (-) (surplus (+)) as a percentage of the gross domestic product using the political affiliation of the President and the controlling parties of the Senate and the House of Representatives. Since the budget is created in the year before the budget is in effect, the deficit (-) (surplus (+)) percentages are associated with the political parties in power the year before the end of budget year. We will only look at years between the extreme deficits of the World War II years and the current extreme deficits.

In order to code the categorical values for political party numerically, we let Democrats take in the value of ‘1’ and Republicans take on the value of ‘-1’. Our data set has four variables and 62 observations. The variables are the deficit (-) (surplus (+)) as a percentage of the gross domestic product for the years 1947 to 2008, the political party of the President for the years 1946 to 2007, the political party controlling the Senate for the years 1946 to 2007, and the political party controlling the House for the years 1946 to 2007. Below are two plots of the rotation of the four variable dimensions projected onto the two dimensional plane using the first two eigenvectors, each multiplied by the square root of the eigenvector’s eigenvalue, one for the covariance method of calculation and the other for the correlation method.

Normally, in principal component analysis, a biplot is created containing plots of both the loadings and the scores. Because three of our variables are dichotomous, the plots of the scores are not very interesting and I have not included the plots. For the covariance method, the first two principal components account for 86% of the variation, and for the correlation method, the first two principal components account for 79% of the variation.

We can see that the parties of the Presidents and the deficits are strongly correlated with each other in either plot, while the Senates and Houses are more weakly correlated with each other in the correlation plot but strongly correlated in the covariance plot. The deficits and the parties of the Presidents do not seem to be strongly correlated with the parties of the Senates or Houses in either plot. In the covariance plot, the lengths of the vectors differ much more than in the correlation plot, since the measurements are not standardized. Basically, the two plots give the much same information.

Below is a table of the eigenvectors and eigenvalues for the correlation method.

 variable eigenvector 1 eigenvector 2 eigenvector 3 eigenvector 4 deficit 0.380 0.583 0.712 0.090 president 0.378 0.587 -0.701 0.145 senate -0.539 0.487 -0.024 -0.687 house -0.650 0.279 0.029 0.706 eigenvalue 1.683 1.462 0.506 0.350

We can see from the table that the first eigenvector is a contrast between the deficit / President and the Senate / House.  Democrats have a positive value and Republicans a negative value in the normalized matrix of measurements. So, for the deficit and the party of the President, deficits that are positive in the normalized data are positively associated with Democratic presidents and deficits that are negative in the normalized data are positively associated with Republican presidents. For the House and Senate, Republican Senates are associated with Republican Houses and Democratic Senates are associated with Democratic Houses. Together, positive normalized budget deficits and Democratic presidents align with Republican Senates and Houses and contrast with Democratic Senates and Houses.

The second eigenvector lumps together, weakly, the deficit, President, Senate, and House. Here, Democrats, at all levels, are aligned with positive values of the normalized deficit and for Republicans, negative deficits are associated with Republicans at all levels.  Also, Democrats are aligned with Democrats and Republicans are aligned with Republicans.

The third eigenvector is mainly a contrast between the deficit and the party of the President. Positive normalized deficits are contrasted with Democratic presidents and aligned with Republican presidents.

The fourth eigenvector is mainly a contrast between the parties of the Senate and House. Democratic Senates are contrasted with Democratic Houses and aligned with Republican Houses and vice versa. The last two eigenvectors only account for a small portion of the variation.

*The Deficit by Political Party Data Set

The deficit by political party data set contains data on the total and on balance deficits, the gross domestic product, and the political parties controlling the presidency, the senate, and the house for the years 1940 to 2011. The data for the deficits and the gross domestic product are from the Office of Management and Budget and can be found at http://www.whitehouse.gov/omb/budget/Historicals. Before 1978, the budget year ran from July 1st to June 30th. From 1978 onward, the budget year ran from October 1st to September 31st. The tables contain a one quarter correction between the years 1977 and 1978, which I have ignored. The data for 2011 is estimated. The year with which the deficit and gross domestic product are associated is the year at the end of the budget year for which the deficit and gross domestic product are calculated.

The data for the controlling political parties of the president, senate, and house were taken from the website of Dave Manuel, http://www.davemanuel.com/history-of-deficits-and-surpluses-in-the-united-states.php, a good resource on the history of the deficit.

Box Plots: A Political Example

BOX PLOTS

Box plots are used to provide information to you about the sizes of a variable in a data set. In the example below, the sizes of the variable – on budget deficit as a percentage of Gross Domestic Product – are compared for differing values of a second variable, a categorical variable describing the controlling political parties of the President, Senate, and House of Representatives. The heavy horizontal line in each box is plotted at the median value for the data group. The top of the box is plotted at the 75th percentile while the bottom of the box is plotted at the 25th percentile. The top and bottom whiskers end at the most extreme values of the points not calculated to be outliers. The circles represent points calculated to be outliers.

The data is from a deficit by political party data set. The deficit by political party data set contains data on the on balance deficits as a percent of the gross domestic product for the years 1940 to 2015, and the political parties controlling the presidency, the senate, and the house for the years 1939 to 2014. The data for the deficit and the gross domestic product are from the Office of Management and Budget and can be found at http://www.whitehouse.gov/omb/budget/Historicals (except the first 8 values for the deficit data, which were from a series published earlier). Before 1978, the budget year ran from July 1st to June 30th. From 1978 onward, the budget year ran from October 1st to September 31st. The tables contain a one quarter correction between the years 1977 and 1978, which I have ignored. The year with which the deficit and gross domestic product are associated is the year at the end of the budget year for which the deficit and gross domestic product are calculated.

For 1940 to 2011, the data for the controlling political parties of the president, senate, and house were taken from the website of Dave Manuel,
http://www.davemanuel.com/history-of-deficits-and-surpluses-in-the-united-states.php
, a good resource on the history of the deficit.  The last three years were taken from the website https://en.wikipedia.org/wiki/United_States_Presidents_and_control_of_Congress

We look at the size of the deficit (-) / surplus (+) using the political affiliations of the President and the controlling parties of the Senate and the House of Representatives. Below are two sets of box plots, one set for the full number of years in the data set and one set for a reduced number of years. Since the budget is created in the year before the budget is in effect, the deficit (-) / surplus (+) percentages are associated with the political parties in power the year before the end of budget year.

The first set of box plots are for the years 1940 to 2015 with regard to the budgets and for the years 1939 to 2014 with regard to the party affiliations. We can see in the first set of box plots that in the years of a Democratic President, House, and Senate, there were some very large deficits. These large deficits occurred during World War II. The large deficit under a Republican President and Democratic Senate and House was in 2009, which budget was associated with the last year of the G. W. Bush presidency. The large surplus in the years of a Democratic President and a Republican Senate and House was in the year 1948, when Truman was president.

The second set of box plots are for the years 1947 to 2015 with regard to the budgets and for the years 1946 to 2014 with regard to the party affiliations. With the years of extreme deficits removed, there is a clearer picture of how the differing party combinations perform in more normal years.  The rather large deficits under the Democrats controlling all bodies were under Obama in his first term, as were the four deficits from 2012 to 2015, when Obama was president, the Democrats controlled the Senate, and the Republicans controlled the House.  Obama inherited a dangerous economic situation and got us through the economic turmoil, but in the process, Obama was running quite large deficits.

The data is here:

comb.party.39to14        defsur.40to15
“1939” “DDD”             “1940” “-3.59917”
“1940” “DDD”             “1941” “-4.90272”
“1941” “DDD”             “1942” “-14.78378”
“1942” “DDD”             “1943” “-30.83472”
“1943” “DDD”             “1944” “-23.29589”
“1944” “DDD”             “1945” “-22.00542”
“1945” “DDD”             “1946” “-7.62084”
“1946” “DDD”             “1947” “1.22684”
“1947” “DRR”             “1948” “4.0015243902439”
“1948” “DRR”             “1949” “-0.252890173410405”
“1949” “DDD”             “1950” “-1.68458781362007”
“1950” “DDD”             “1951” “1.31337813072694”
“1951” “DDD”             “1952” “-0.951048951048951”
“1952” “DDD”             “1953” “-2.16993464052288”
“1953” “RRD”             “1954” “-0.722207892700542”
“1954” “RRD”             “1955” “-1.00737100737101”
“1955” “RDD”             “1956” “0.569476082004556”
“1956” “RDD”             “1957” “0.560103403705299”
“1957” “RDD”             “1958” “-0.695762175838077”
“1958” “RDD”             “1959” “-2.39319620253165”
“1959” “RDD”             “1960” “0.0934404784152495”
“1960” “RDD”             “1961” “-0.693937180423667”
“1961” “DDD”             “1962” “-1.00528199011757”
“1962” “DDD”             “1963” “-0.645890521556596”
“1963” “DDD”             “1964” “-0.980540051289787”
“1964” “DDD”             “1965” “-0.225130153369917”
“1965” “DDD”             “1966” “-0.396470136846144”
“1966” “DDD”             “1967” “-1.50322118826056”
“1967” “DDD”             “1968” “-3.08017346825309”
“1968” “DDD”             “1969” “-0.0509009467576097”
“1969” “RDD”             “1970” “-0.829282241921647”
“1970” “RDD”             “1971” “-2.33181452693648”
“1971” “RDD”             “1972” “-2.14022140221402”
“1972” “RDD”             “1973” “-1.12094395280236”
“1973” “RDD”             “1974” “-0.484457004440856”
“1974” “RDD”             “1975” “-3.35899664721222”
“1975” “RDD”             “1976” “-3.87644528849913”
“1976” “RDD”             “1977” “-2.46006704791954”
“1977” “DDD”             “1978” “-2.43174435958213”
“1978” “DDD”             “1979” “-1.5408560311284”
“1979” “DDD”             “1980” “-2.61370137299771”
“1980” “DDD”             “1981” “-2.35470303339281”
“1981” “RRD”             “1982” “-3.63921663297022”
“1982” “RRD”             “1983” “-5.86540905368388”
“1983” “RRD”             “1984” “-4.68781623153208”
“1984” “RRD”             “1985” “-5.18686774072686”
“1985” “RRD”             “1986” “-5.24459337316197”
“1986” “RRD”             “1987” “-3.52161274807085”
“1987” “RDD”             “1988” “-3.73028651238579”
“1988” “RDD”             “1989” “-3.68761220825853”
“1989” “RDD”             “1990” “-4.693470395293”
“1990” “RDD”             “1991” “-5.26014304184874”
“1991” “RDD”             “1992” “-5.29006791303402”
“1992” “RDD”             “1993” “-4.42096278090921”
“1993” “DDD”             “1994” “-3.59554308260858”
“1994” “DDD”             “1995” “-2.9854682596197”
“1995” “DRR”             “1996” “-2.18091573392828”
“1996” “DRR”             “1997” “-1.21652206714447”
“1997” “DRR”             “1998” “-0.333899137892527”
“1998” “DRR”             “1999” “0.0199779191420009”
“1999” “DRR”             “2000” “0.851382511184249”
“2000” “DRR”             “2001” “-0.306684588152888”
“2001” “RDR”             “2002” “-2.91811085879249”
“2002” “RDR”             “2003” “-4.75097949242879”
“2003” “RRR”             “2004” “-4.69864169548169”
“2004” “RRR”             “2005” “-3.82965187098977”
“2005” “RRR”             “2006” “-3.17507873756823”
“2006” “RRR”             “2007” “-2.38918096195603”
“2007” “RDD”             “2008” “-4.3504785661994”
“2008” “RDD”             “2009” “-10.7509053320938”
“2009” “DDD”             “2010” “-9.26715545494476”
“2010” “DDD”             “2011” “-8.88732833957553”
“2011” “DDR”             “2012” “-7.16843865428771”
“2012” “DDR”             “2013” “-4.35807759681418”
“2013” “DDR”             “2014” “-2.99182355166293”
“2014” “DDR”             “2015” “-2.61579248907512”

The code for the plot is here:

function () {
par(mfrow = c(1, 2), oma = c(2, 2, 4, 2) + 0.1)
boxplot(defsur.40to15 ~ comb.party.39to14, cex.axis = 0.8)
title(main = “Percentage of GDP – 1940 to 2015\nPolitical Parties – 1939 to 2014”,
xlab = “President, Senate, House”, ylab = “Percentage”,
cex.main = 1, font.main = 1)
boxplot(defsur.40to15[8:76] ~ comb.party.39to14[8:76], cex.axis = 0.8)
title(main = “Percentage of GDP – 1947 to 2015\nPolitical Parties – 1946 to 2014”,
xlab = “President, Senate, House”, ylab = “Percentage”,
cex.main = 1, font.main = 1)
mtext(“On Budget Deficit as a Percetage of GDP\nGrouped by Political Party Controling the Presidency, Senate, and House\nParty Lagged One Year”,
side = 3, cex = 1, font = 1, outer = T)
}