You can use log linear models to fit cross classified catagorical data. In fitting log linear models, the logarithm of the expected counts in the cells of the contingency table are modeled as linear combinations of parameters, where there can be parameters for interactions between dimensions in the table. I follow the approach of Bishop, Fienberg, and Holland in Discrete Multivariate Analysis: Theory and Practice, MIT Press, 1975.
Some terminology is in order. An hierarchical model is a model in which, if a parameter is not in the model, then neither are the higher level interactions associated with the parameter. A complete model is a model in which none of the expected cell counts are equal to zero. An incomplete model has at least one cell where the expected cell count is zero. A saturated model is a model in which all possible parameters are fit. An unsaturated model is a model in which some of the parameters are set to zero. In Bishop, et.al., each dimension of the table is called a factor and only hierarchical models are fit, which can be either complete (chapter 3) or incomplete (chapter 5). The parameters are modeled such that the sum over the indices for a given parameter and factor equals zero (except of course the first parameter, ‘u’).
The Saturated Model
Following the notation of Bishop, et.al., for a three dimensional contingency table, the saturated model would be
log(mijk) = u + u1(i) + u2(j) + u3(k) + u12(ij) + u13(ik) + u23(jk) + u123(ijk),
where ‘mijk‘ is the expected cell count in cell (i,j,k).
Say that we have two levels for each of the three factors. Then, the indices associated with the factors, ‘i’, ‘j’, and ‘k’, can each take on either the value ‘1’ or the value ‘2’. For this example, since there are only two levels for each factor and the parameters sum to zero over each index, specifying a parameter (say u12(11)) for a given set of factors (here, the interaction between factors 1 and 2) means that the other parameters for the given set of factors (in this case, u12(12), u12(21), and u12(22)) are specified too. It follows that, for this example, there are eight unique parameters in the saturated model, one for each factor combination (‘1′,’2′,’3′,’12’,’13’,’23’, and ‘123’) and one for the unindexed parameter ‘u’.
There are eight possible unique sets of values for the three indices ‘i’, ‘j’, and ‘k’ ((1,1,1), (1,1,2), … , (2,2,2)), representing the eight cells in the table. If ‘i’, ‘j’ and ‘k’ are given, so are ‘ij’, ‘ik’, ‘jk’, and ‘ijk’. This means that the eight estimated cell counts are uniquely defined. In the model, there are eight unique parameters, and there are eight cell counts. It follows that the saturated model fits the data exactly – the estimates of the expected cell counts are the cell counts themselves.
An Unsaturated Models
The saturated model is not usually of interest. For unsaturated models, in order to estimate the expected cell counts, Bishop, et.al., model the data in the cells as either independent Poisson variables or as variables from a single multinomial distribution, and find minimal sufficient statistics for the expected cell counts based on either of the two probability distributions. Under the assumptions, the minimal sufficient statistics are the same for either distribution.
An example of an unsaturated model based on the saturated model given above is
log(mijk) = u + u1(i) + u2(j) + u3(k) + u13(ik) + u23(jk) .
Here, the main effects of each of the factors are included (In Bishop, et.al., the main effects are always included. According to Bishop, et.al., such models are called comprehensive models.) The only higher order effects which are still present are the interactions between the first and third factors and between the second and third factors. Note that, since the u12(ij) term is not in the model, the term u123(ijk) cannot be either if the model is to be heirarchical. For an unsaturated model, Bishop, et.al., give an algorithm to find maximum likelihood estimators for the expected cell counts using the minimal sufficient statistics of the given unsaturated model. The maximum likelihood estimators do not depend on which probability distribution is used. (The results that I give below are based on an R function that I wrote to apply the algorithm.)
Evaluating an Unsaturated Model
Bishop, et.al., follow the method of evaluating the fit of an unsaturated model by using the chi squared distribution of either
Χ2 = Σ ( xi – ̂mi )2 / ̂mi
G2 = Σ -2 xi log( ̂mi / xi ) ,
where xi is the cell count in cell i and ̂mi is the estimated expected cell count in cell i, and i goes over all of the cells in the array for which ̂mi does not equal zero. Χ2 then is the Pearson chi square statistic and G 2 is minus twice the log likelihood ratio. Χ2 and G2 are goodness of fit statistics. A model fits if the chi square statistic falls in the ‘normal’ range for a chi square of the appropriate degrees of freedom. Bishop, et.al., only look at the right tail of the appropriate chi square distribution, accepting models that fit the data too well. Among the models considered, the most parsimonious model that still fits the data is accepted.
Fitting a Loglinear Model to the Deficit by Political Party Data Set
To demonstrate fitting a log linear model, I will use the data in the Deficit by Political Party Data Set. The deficits (surpluses) as a percentage of the gross domestic product are classed into four classes and the political parties in control are associated with the deficit (surplus) of the following year. The political parties go from 1946 to 2007 and the deficits (surpluses) as a percentage of gross domestic product go from 1947 to 2008. The data array is given below.
Since the interaction between the deficit (surplus) is of interest and the senate and house seem to be interrelated, the following models were fit.
1: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14+ u23 + u24 + u34
2: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14 + u34
3: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14
4: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u14 + u34
5: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u14
6: log(mijk) = u + u1 + u2 + u3 + u4
The subscript 1 refers to the deficit (surplus) classes and takes on 4 levels, the subscript 2 refers to the party of the president and takes on 2 levels, the subscipt 3 refers to the party controlling the senate and takes on 2 levels, and the subscript 4 refers to the party controlling the house and takes on 2 levels. The G2 values found by fitting the six models, as well as the degrees of freedom for and the p-values from the chi squared tests, are given in the table below.
From the table, models 1, 2, 4, and 5 all fit the data. Chi squared tests to compare the models
are given in the table below.
|Models||Difference||in G2||Difference||in df||p-value|
|1: & 2:||1.69||2||0.4296|
|2: & 4:||1.36||3||0.7149|
|4: & 5:||5.28||1||0.0216|
There is little difference between models 1 and 2, and 2 and 4, so out of models 1, 2, and 4, the most parsimonius model is model 4. Model 5 is more parsimonious than model 4, however the test of the difference between the models shows a significant difference from the central chi squared distribution with 1 degree of freedom. It appears that something is lost when going from model 4 to model 5. We conclude that model 4 is the best model and that the interaction between the house and senate, the interaction between the deficit level and the party of the president, and the interaction between the deficit level and the party controlling the house are important and that the interaction between the party controlling the senate and the deficit level is not.
Below is a table of the fitted cells, normalized to sum to 100 across rows, for model 4.
Democratic presidents perform much better than Republican presidents. With regard to Republican presidents, there is no difference between Democratic and Republican senates. Republican houses perform somewhat better than Democratic houses under both Democratic and Republican presidencies.