Multiple Correspondence Analysis: A Political Example

MULTIPLE CORRESPONDENCE ANALYSIS

Correspondence analysis and multiple correspondence analysis are techniques from multivariate analysis. You can use the techniques to find clusters in a data set. Correspondence and multiple correspondence analysis are similar to principal component analysis, in that the analysis attempts to reduce the dimensions (number of columns or rows) of a set of intercorrelated variables so that the smaller dimensioned (number of columns or rows) variables explain most of the variation in the original variables. However, correspondence and multiple correspondence analysis are for categorical variables rather than the numerical variables of principal component analysis. Correspondence analysis was developed by Jean Paul Benzecri and measures similarities of patterns in contingency tables.

The Mathematics Behind Correspondence Analysis

Correspondence analysis is used in the analysis of just two categorical variables. In correspondence analysis, the reduced variables are found by applying singular value decomposition to a transformation of the contingency table created from the two original variables. The transformation replaces the value in each cell of the contingency table by the original value minus the product of the row total and the column total divided by the overall total, with the difference divided by the square root of the product of the row total and the column total. The resulting cells then contain the signed square roots of the terms used in the calculation of the chi square test for independence for a contingency table, divided by the square root of the overall total.

The Mathematics Behind Multiple Correspondence Analysis

For multiple correspondence analysis, more than two categorical variables are reduced. The reduced set of variables is found by applying correspondence analysis to one of two matrices. The first matrix is a matrix made up of observations in the rows and indicator variables in the columns, where the indicator variables take on the value one if the observation has a quality measured by the variable and zero if the individual does not. For example, say there are three variables, ‘gender’, ‘hair color’, and ‘skin tone’. Say that the categories for gender are ‘female’, ‘male’, ‘prefer not to answer’; for hair color, ‘red’, ‘blond’, ‘brown’, ‘black’; and for skin tone, ‘light’, ‘medium’, and ‘dark’, then, there would be three columns associated with gender and, for a given person, only one would contain a one, the others would contain zeros; there would be four columns associated with hair color and, for a given person, only one would contain a one, others would contain zeros; and there would be three columns associated with skin tone and, for a given person, only one would contain a one, the others would contain zeros.

The second type of matrix is a Burt table. A Burt table is multiple sort of contingency table. The contingency tables between the variables make up blocks of the matrix. From the example above, the first block is the contingency table of gender by gender, and is made of up of a diagonal matrix with the counts of males, females, and those who did not want to answer on the diagonal. The second block, going horizontally, is the contingency table of gender by hair color. The third block, going horizontally, is the contingency table of gender by skin tone. The second block, going vertically, is the contingency table of hair color by gender. The rest of the blocks are found similarly.

Plotting for Clustering

Once the singular value decomposition is done and the reduced variables are found, the variables are usually plotted to look for clustering of attributes. (For the above example, some of the attributes are brown hair, male, red hair, light skin tone, each of which would be one point on the plot.) Usually just the first two dimensions of the reduced matrix are plotted, though more dimensions can be plotted. The dimensions are ordered with respect to how much of the variation in the input matrix the dimension explains, so the first two dimensions are the dimensions that explain the most variation. With correspondence analysis, the reduced dimensions are with respect to the contingency table. Both the attributes of the rows and the attributes of the columns are plotted on the same plot. For multiple correspondence analysis, the reduced dimensions are with respect to the matrix used in the calculation. If one uses the indicator variable matrix, one can plot just the attributes for the columns or one can also plot labels for the rows on the same plot (or plot just the labels for the rows). If one uses the Burt table, one can only plot attributes for the columns. For multiple correspondence analysis, the plots for the columns are the same by either method, thought the scaling may be different.

Interpreting the Plot

The interpretation of the row and column variables in correspondence anaylsis is done separately. Row variables are compared as to level down columns and column variables are compared as to level across rows. However if a row variable is near a column variable in the plot, then both are represented at similar relative levels in the contingency table.

The interpretation of the relationship between the variables in the indicator or Burt table is a bit subtle. Within an attribute, the levels of the attribute are compared to each other on the plot. Between attributes, points that are close together are seen at similar levels.

An Example Using the Deficit by Political Party Data Set

Below is a multiple correspondence plot for which only the column reduction is plotted. We look at the size of the deficit (-) / surplus (+) using the political affiliation of the President and the controlling parties of the Senate and the House of Representatives. The size of the onbudget budget deficit (-) / surplus(+) as a percentage of gross domestic product for the years 1947 to 2008 was classified into four classes. The largest deficit over the years was -6.04 percent and the largest surplus was 4.11 percent. The class breaks were -6.2, -4, -2, 0, and 4.2, which gives four classes. (There was only one observation with a surplus greater than 2, so years of surplus were all classed together.) Each of the President, Senate, and House variables were classified into two classes, either Democrat or Republican. Since the budget is created in the year before the budget is in effect, the deficit (-) / surplus (+) percentages are associated with the political parties in power the year before the end of the budget year.

The first principal axis appears to measure distance between the relative counts for the parties of the senate, and house. On the plot, the greatest distances on the first principal axis are between the Republican and Democratic senates and the Republican and Democratic houses. Looking at the tables below, the lowest counts were for the Republican senates and houses, which means the highest counts were for the Democratic senates and houses. For the deficit classes, classes (0,4.2] is the smallest class in the table and, like the Republican senates and houses, is to the left on the plot. Still there is not much difference between the deficit classes on the first principal axis (or the parties of the presidents).

The second principal axis appears to show how the differing levels of deficit (or surplus) are associated with the political parties of the presidents, senates, and houses. Looking at the senates and houses, there is not a lot of difference between the Democrats and the Republicans on the second principal axis, both cluster around the deficit class (-4,-2]. For the presidents, however, there is a great difference. Democratic presidents are near the deficit classes (-2,0] and (0,4.2] on the second principal axis while Republican presidents are between the deficit classes (-4,-2] and (-6.2,-4].

Below are two classifications of the data.

 (-6.2,-4] (-4,-2] (-2,0] (0,4.2] 11 22 21 8
 Party President Senate House Democrat 27 42 48 Republican 35 20 14

In this example, we have treated the deficit, which is a numeric variable, as a categorical variable. In the post before this post, an example of principal component analysis, the same data is analyzed, with all of the variables treated as numeric rather than categorical.

Some Tools for Politicians – Researchers – Pundits

Word clouds tell you what people are saying about candidates:

You can see that tweets containing Hillary or Clinton and Bernie or Sanders have more words than tweets containing just Cruz or just Trump.  Tweets are centered on Indiana, which was just having their primary.  Each word cloud was based on 100 tweets.

You can use Box plots to provide visual information about a Data Set:

For example, You might find it is Useful to know How the different Mixes of Political Parties performed with respect to the Budget Deficit, using the Majority Party of the House and Senate and the Political Party of the President.

Bar graphs compare categories:

Comparing the different Tools governments use to Raise Money might be of interest to you for setting or evaluating policy on taxes.

Principal component plots show how numeric data clusters:

You can see from the plot that the parties of the Senate and House cluster together – Republican  Houses and Senates are similar and Democratic Houses and Senates are similar – and the  party of the President clusters with the Budget Deficit. You could find this information very useful.  These plots do not include the Obama years or the years of WWII.

Multiple correspondence plots show how categorical data clusters:

You can see that the House and Senate cluster together for both Republicans and Democrats while Democratic presidents cluster with small deficits than Republican presidents. This might be of interest for your campaign or analysis.

Log linear modeling allows comparisons between classes:

Expected Cell Counts Normalized to Percentages by Row
For the On Budget Deficit (-) (Surplus (+)) per GDP
by the Controlling Political Parties
of the Presidency, Senate, and House
Budget Deficit from 1947 to 2008Political Parties Lagged One Year
Rows Sum to 100 Except for Rounding
President Senate House (-6.2,4] (-4,-2] (-2,0] (0,4.2]
D D D 0% 30% 55% 15%
R
R D
R 0% 29% 42% 29%
R D D 33% 40% 21% 7%
R 26% 42% 17% 15%
R D 33% 40% 21% 7%
R 26% 42% 17% 15%

In the table, each row is the combination of the  Presidential, Senate, and House parties. You find in the rows the log linear modeled percentage of years in which the President, Senate, and House combination falls in each Budget Deficit class. The Party of the Senate does not affect the outcome and the Parties of the President and House do – Republican Houses do a bit better –  Democratic Presidents do much better – which might be of interest to you.

Odds ratios can be used to compare parties:

Iowa 2013 Legislature
Odds Legislator is Male
House Senate
Democrats 3 : 2 10 : 3
Republicans 16 : 2 10 : 2

You could be interested in the odds a legislator is a male, by party.
For both the House and Senate in Iowa, Democrats have lower odds of being male than Republicans. Depending on your political philosophy, this could be bad or good.

Box Plots: A Political Example

BOX PLOTS

Box plots are used to provide information to you about the sizes of a variable in a data set. In the example below, the sizes of the variable – on budget deficit as a percentage of Gross Domestic Product – are compared for differing values of a second variable, a categorical variable describing the controlling political parties of the President, Senate, and House of Representatives. The heavy horizontal line in each box is plotted at the median value for the data group. The top of the box is plotted at the 75th percentile while the bottom of the box is plotted at the 25th percentile. The top and bottom whiskers end at the most extreme values of the points not calculated to be outliers. The circles represent points calculated to be outliers.

The data is from a deficit by political party data set. The deficit by political party data set contains data on the on balance deficits as a percent of the gross domestic product for the years 1940 to 2015, and the political parties controlling the presidency, the senate, and the house for the years 1939 to 2014. The data for the deficit and the gross domestic product are from the Office of Management and Budget and can be found at http://www.whitehouse.gov/omb/budget/Historicals (except the first 8 values for the deficit data, which were from a series published earlier). Before 1978, the budget year ran from July 1st to June 30th. From 1978 onward, the budget year ran from October 1st to September 31st. The tables contain a one quarter correction between the years 1977 and 1978, which I have ignored. The year with which the deficit and gross domestic product are associated is the year at the end of the budget year for which the deficit and gross domestic product are calculated.

For 1940 to 2011, the data for the controlling political parties of the president, senate, and house were taken from the website of Dave Manuel,
http://www.davemanuel.com/history-of-deficits-and-surpluses-in-the-united-states.php
, a good resource on the history of the deficit.  The last three years were taken from the website https://en.wikipedia.org/wiki/United_States_Presidents_and_control_of_Congress

We look at the size of the deficit (-) / surplus (+) using the political affiliations of the President and the controlling parties of the Senate and the House of Representatives. Below are two sets of box plots, one set for the full number of years in the data set and one set for a reduced number of years. Since the budget is created in the year before the budget is in effect, the deficit (-) / surplus (+) percentages are associated with the political parties in power the year before the end of budget year.

The first set of box plots are for the years 1940 to 2015 with regard to the budgets and for the years 1939 to 2014 with regard to the party affiliations. We can see in the first set of box plots that in the years of a Democratic President, House, and Senate, there were some very large deficits. These large deficits occurred during World War II. The large deficit under a Republican President and Democratic Senate and House was in 2009, which budget was associated with the last year of the G. W. Bush presidency. The large surplus in the years of a Democratic President and a Republican Senate and House was in the year 1948, when Truman was president.

The second set of box plots are for the years 1947 to 2015 with regard to the budgets and for the years 1946 to 2014 with regard to the party affiliations. With the years of extreme deficits removed, there is a clearer picture of how the differing party combinations perform in more normal years.  The rather large deficits under the Democrats controlling all bodies were under Obama in his first term, as were the four deficits from 2012 to 2015, when Obama was president, the Democrats controlled the Senate, and the Republicans controlled the House.  Obama inherited a dangerous economic situation and got us through the economic turmoil, but in the process, Obama was running quite large deficits.

The data is here:

comb.party.39to14        defsur.40to15
“1939” “DDD”             “1940” “-3.59917”
“1940” “DDD”             “1941” “-4.90272”
“1941” “DDD”             “1942” “-14.78378”
“1942” “DDD”             “1943” “-30.83472”
“1943” “DDD”             “1944” “-23.29589”
“1944” “DDD”             “1945” “-22.00542”
“1945” “DDD”             “1946” “-7.62084”
“1946” “DDD”             “1947” “1.22684”
“1947” “DRR”             “1948” “4.0015243902439”
“1948” “DRR”             “1949” “-0.252890173410405”
“1949” “DDD”             “1950” “-1.68458781362007”
“1950” “DDD”             “1951” “1.31337813072694”
“1951” “DDD”             “1952” “-0.951048951048951”
“1952” “DDD”             “1953” “-2.16993464052288”
“1953” “RRD”             “1954” “-0.722207892700542”
“1954” “RRD”             “1955” “-1.00737100737101”
“1955” “RDD”             “1956” “0.569476082004556”
“1956” “RDD”             “1957” “0.560103403705299”
“1957” “RDD”             “1958” “-0.695762175838077”
“1958” “RDD”             “1959” “-2.39319620253165”
“1959” “RDD”             “1960” “0.0934404784152495”
“1960” “RDD”             “1961” “-0.693937180423667”
“1961” “DDD”             “1962” “-1.00528199011757”
“1962” “DDD”             “1963” “-0.645890521556596”
“1963” “DDD”             “1964” “-0.980540051289787”
“1964” “DDD”             “1965” “-0.225130153369917”
“1965” “DDD”             “1966” “-0.396470136846144”
“1966” “DDD”             “1967” “-1.50322118826056”
“1967” “DDD”             “1968” “-3.08017346825309”
“1968” “DDD”             “1969” “-0.0509009467576097”
“1969” “RDD”             “1970” “-0.829282241921647”
“1970” “RDD”             “1971” “-2.33181452693648”
“1971” “RDD”             “1972” “-2.14022140221402”
“1972” “RDD”             “1973” “-1.12094395280236”
“1973” “RDD”             “1974” “-0.484457004440856”
“1974” “RDD”             “1975” “-3.35899664721222”
“1975” “RDD”             “1976” “-3.87644528849913”
“1976” “RDD”             “1977” “-2.46006704791954”
“1977” “DDD”             “1978” “-2.43174435958213”
“1978” “DDD”             “1979” “-1.5408560311284”
“1979” “DDD”             “1980” “-2.61370137299771”
“1980” “DDD”             “1981” “-2.35470303339281”
“1981” “RRD”             “1982” “-3.63921663297022”
“1982” “RRD”             “1983” “-5.86540905368388”
“1983” “RRD”             “1984” “-4.68781623153208”
“1984” “RRD”             “1985” “-5.18686774072686”
“1985” “RRD”             “1986” “-5.24459337316197”
“1986” “RRD”             “1987” “-3.52161274807085”
“1987” “RDD”             “1988” “-3.73028651238579”
“1988” “RDD”             “1989” “-3.68761220825853”
“1989” “RDD”             “1990” “-4.693470395293”
“1990” “RDD”             “1991” “-5.26014304184874”
“1991” “RDD”             “1992” “-5.29006791303402”
“1992” “RDD”             “1993” “-4.42096278090921”
“1993” “DDD”             “1994” “-3.59554308260858”
“1994” “DDD”             “1995” “-2.9854682596197”
“1995” “DRR”             “1996” “-2.18091573392828”
“1996” “DRR”             “1997” “-1.21652206714447”
“1997” “DRR”             “1998” “-0.333899137892527”
“1998” “DRR”             “1999” “0.0199779191420009”
“1999” “DRR”             “2000” “0.851382511184249”
“2000” “DRR”             “2001” “-0.306684588152888”
“2001” “RDR”             “2002” “-2.91811085879249”
“2002” “RDR”             “2003” “-4.75097949242879”
“2003” “RRR”             “2004” “-4.69864169548169”
“2004” “RRR”             “2005” “-3.82965187098977”
“2005” “RRR”             “2006” “-3.17507873756823”
“2006” “RRR”             “2007” “-2.38918096195603”
“2007” “RDD”             “2008” “-4.3504785661994”
“2008” “RDD”             “2009” “-10.7509053320938”
“2009” “DDD”             “2010” “-9.26715545494476”
“2010” “DDD”             “2011” “-8.88732833957553”
“2011” “DDR”             “2012” “-7.16843865428771”
“2012” “DDR”             “2013” “-4.35807759681418”
“2013” “DDR”             “2014” “-2.99182355166293”
“2014” “DDR”             “2015” “-2.61579248907512”

The code for the plot is here:

function () {
par(mfrow = c(1, 2), oma = c(2, 2, 4, 2) + 0.1)
boxplot(defsur.40to15 ~ comb.party.39to14, cex.axis = 0.8)
title(main = “Percentage of GDP – 1940 to 2015\nPolitical Parties – 1939 to 2014”,
xlab = “President, Senate, House”, ylab = “Percentage”,
cex.main = 1, font.main = 1)
boxplot(defsur.40to15[8:76] ~ comb.party.39to14[8:76], cex.axis = 0.8)
title(main = “Percentage of GDP – 1947 to 2015\nPolitical Parties – 1946 to 2014”,
xlab = “President, Senate, House”, ylab = “Percentage”,
cex.main = 1, font.main = 1)
mtext(“On Budget Deficit as a Percetage of GDP\nGrouped by Political Party Controling the Presidency, Senate, and House\nParty Lagged One Year”,
side = 3, cex = 1, font = 1, outer = T)
}