FITTING A SIMPLE REGRESSION MODEL WITH AUTOCORRELATED ERRORS: AN ASTROLOGICAL EXAMPLE

The assumptions of simple linear regression include the assumption that the errors are independent with constant variance. Fitting a simple regression when the errors are autocorrelated requires techniques from the field of time series. If you are interested in fitting a model to an evenly spaced series where the terms are autocorrelated, I have given below an example of fitting such a model.

The Astronomical/Astrological Hypotheses to be Tested

For some years now, I have been interested in the possible astrological influence of a composite astronomical point, the mean over the shortest arc of the celestial longitudes of the celestial bodies Juno, Ceres, Vesta, and Pallas. I wanted to see if the point affected divorce rates. In astrology, planets rule zodiacal signs. In older systems of astrology, from before the discoveries of the outer planets, each of the five known planets (Mercury, Venus, Mars, Jupiter, and Saturn – we exclude the earth) ruled two signs. As Uranus, Neptune, and Pluto were discovered, the planets were given rulership over one of Saturn’s, Jupiter’s, and Mars’s signs, respectively. In 1977, the celestial body Chiron was discovered. Some astrologers give Chiron the rulership of one of Mercury’s signs, which would leave Venus as the only planet with dual rulership.

The celestial bodies Juno, Ceres, Vesta, and Pallas were discovered in the early 1800’s. Referencing Roman mythology, some astrologers feel that the discoveries of Juno – the wife, Ceres – the mother, Vesta – the sister, and Pallas – the daughter synchronized with the emergence of the struggle for the full participation of women within society. Venus (which, from the above, is for some astrologers the only planet that still rules two signs) is one of the traditional rulers of women in astrology. The two signs that Venus rules are those associated with money and marriage. I had thought that the mean of the celestial longitudes over of the shortest arc connecting Juno, Ceres, Vesta, and Pallas might be the ruler of the sign of marriage, since Juno, Ceres, Vesta, and Pallas refer to female archetypes. Then Venus would just rule the sign of money. The mean over the shortest arc jumps 90 degrees or 180 degrees from time to time. To test the hypothesis, I had thought that the jump point of the mean over the shortest arc might be related to divorce. An ephemeris for the average, which I call Vulcan – as well as a point I call Cent, from 1900 to 2050, can be found here – https://1drv.ms/b/s!AgjNmjy3BpFDmBBTiEPHc3qPr9ls

The Data Sets: Divorce Counts for Iowa and the Jump Points

I created a data set using monthly divorce counts in Iowa from August of 2006 to November of 2009 and information about the jump points of the mean over the shortest arc of the celestial longitudes of Juno, Vesta, Ceres, and Pallas. The divorce counts were adjusted for the number of business days in a month. The adjusted divorce counts are the dependent variable. The independent variables are the usual column of ones to fit a mean and a column which linearly increases as time increases after a jump point and which is set back to zero at each jump point.

The marriage and divorce data set contains data from the Center for Disease Control and Prevention website, from the National Vital Statistics Report page. I took marriage and divorce data for Iowa from the reports from August of 2006 to November of 2009. The divorce data for April of 2008 was 161, which was unusually small. In my analyses, I replaced the number with 703, the average of the numbers on either side.

Also contained in the data set is a variable which measures an astronomical phenomena. When one takes the average of the longitudes of the four major asteroids, Juno, Ceres, Pallas, and Vesta (yes, I know Ceres is now a dwarf planet) along the shortest arc, the average longitude will jump by 90 or 180 degrees from time to time. The third variable in the data set is the time in days from the last day in which a jump occurred. Since the marriage and divorce data is monthly, each data point is assumed to be at the 15th of the month for the calculation of the third variable.

The question of interest is, are the jump points associated the divorce counts in a linear way? Since the months are not of equal length, I have adjusted the divorce counts by dividing the divorce count for a month by the number of state business days in the month and multiplying the result by the average number of state business days in a month over the observed months. The holidays I excluded are New Year’s Day, Martin Luther King Day, Memorial Day, Independence Day, Labor Day, Veteran’s Day, the two days for Thanksgiving, and Christmas Day, as these days were state holidays in Iowa in 2012. The number of state business days that I found for each month are given in the table below.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 23 20 22 19 20
2007 21 20 22 21 22 21 21 23 19 23 19 20
2008 21 21 21 22 21 21 22 21 21 23 17 22
2009 20 20 22 22 20 22 22 21 21 22 18

Plotting the Data Sets

Since the dependent variable is a time series, one could expect autocorrelation in the error terms which may need to be taken into account. Below, I have plotted the adjusted divorce counts and the jump point variable.

time series plots of divorce counts and the jump point variable

A relationship between the two plots is not immediately evident.

Fitting a Simple Regression Model

I started by fitting a simple linear regression model. The estimated model parameters are given below.

Coefficient Standard Error
Intercept 621.2 20.2
Slope 0.06049 0.05440

Clearly, there is little reason to think that the jump points are related to the adjusted divorce counts given the simple regression model. However, since the data is time series data, there is a possibility that the errors are autocorrelated.

The Box-Jenkins Approach

Following the Box-Jenkins approach to fitting time series, I decided to start by looking at the residuals, from the adjusted divorce count model, as a stationary time series.

I generated two plots; a plot of the correlations between present and lagged variables using the R function acf() and a plot of the partial correlations between present and lagged variables using the R function pacf(). For the autocorrelation and the partial autocorrelation plots the blue lines represent the 95% level.

correlation and partial correlation plots for the divorce count residuals

Following the Box-Jenkins approach, the correlation plot will indicate if the time series is a moving average. If there are spikes in the correlation plot, the spikes indicate the orders of the moving average. An autoregressive time series will have an exponentially decaying correlation plot.

The partial correlation plot indicates the order of the process if the process is autoregressive. The order of the autoregressive process is given by the value of the lag just before the partial correlation goes to essentially zero.

A process with both moving average and autoregressive terms is hard to identify using correlation and partial correlation plots, but the process is indicated by decay that is not exponential, but starts after a few lags (see the Wikipedia page Box-Jenkins).

Looking at the above plots, the autocorrelation plot shows negative spikes at lag 1 and lag 13 and is large in absolute value at lags 3, 12, and 14, but otherwise is consistent with random noise. The partial correlation plot has negative spikes at lags 1, 2, and 13 and is large in absolute value at lag 15, indicating that the process may not be stationary. Neither plot decays.

The Periodogram and Normalized Cumulative Periodogram

Using periodograms is another way you can evaluate autocorrelation. A periodogram can find periodicities in a time series. Below are plots of the periodogram and the normalized cumulative periodogram created using spec.pgram() and cpgram() in R.  In the periodogram, there is a spikes at about the frequencies .35 and .40 that are quite large. The frequencies correspond to lags of 2.9 and 2.5, somewhat close to the lags of 2 and 3 months found in the correlation plot. The normalized cumulative periodogram shows a departure from random noise at a frequency of around .29, or a lag of around 3.4. The blue lines in the normalized cumulative periodogram plot are the 95% confidence limits for random noise.

periodogram and normalized cumulative periodogram for the divorce count residuals

Note also that at the frequency of twenty, corresponding to a lag of two, the periodogram is also high.

Using arima() in R to Compare Models

I used the function arima() in R to fit some autoregressive moving average models with orders up to thirteen, both with and without the astronomical regressor. The best model, using the Aktaike information criteria (see the Wikipedia page Aktaike information criterion), was

Yt = β0 + β1 Xt + θ1 εt-1 + θ3 εt-3 + θ13 εt-13 + εt,

where Yt is the number of divorces in month t, Xt is the number of days since the last shift in the mean of the celestial longitudes of Ceres, Pallas, Vesta, and Juno over the shortest arc, and εt is the difference between the expected number of divorces and the actual number of divorces for month t. The estimated coefficients, with the respective standard errors, are given in the table below.

Model for Adjusted Divorces in Iowa by Month
Coefficient Standard Error
βhat0 624 10.5
βhat1 0.054 0.031
θhat1 -0.682 0.274
θhat3 0.386 0.180
θhat13 -0.598 0.227

The coefficient for the slope of the regression is not at least 1.96 times the size of the standard errors of the coefficient, however, the Aktaike information criteria is the smallest for the models at which I looked. The model without the coefficient is a close contender, with the Aktaike information for the two models being 453.07 and 453.85. Below is a plot of the residuals for the two competing models. To me the residuals look more even for the model given above.

residual plot for two models

The estimates indicate that the intercept is about 624 divorces; that the number of divorces increases by about 5.4 for every one hundred day increase in the time after the midpoint over the shortest arc of the longitudes of Juno, Ceres, Vesta, and Pallas shifts by 90 or 180 degrees; that errors tend to reverse in direction over one lag and thirteen lags, but tend to be in the same direction over three lags.

Some Astrological Observations

The results are interesting from an astrological point of view. Astrological signs are divided into groups by order. The first grouping is by alternating order, with the first sign (Aries) positive and the next sign negative, the third positive, and so on through the circle. The second grouping is in groups of three’s, called the elements – fire, earth, air, and water. The first, fifth, and ninth signs are fire signs, and so on through the circle. The third grouping is in groups of four, called the qualities – cardinal, fixed, and mutable. The first, fourth, seventh, and tenth signs are cardinal signs, and so on through the circle. The months, though I am quite certain that the months were originally concurrent with the zodiac signs, overlap signs. About 2/3 of one sign and 1/3 of the next sign are in each month. The adjusted counts are modeled and used in what follows

Averaging the counts for the February’s, April’s, June’s, August’s, October’s, and December’s, months that are 2/3 positive and 1/3 negative, the result is 616.7 divorces with an estimated standard error of 15.8. Averaging the adjusted counts for the January’s, March’s, May’s, July’s, September’s, and November’s, the months that are 2/3 negative and 1/3 positive, the result is 661.5 divorces with an estimated standard error of 15.8. The difference between the two groupings is -44.9 with an estimated standard error of 25.8. The difference between the counts is significant at the 0.10 level for a two-sided test. Positive months have fewer divorces than negative months in this data set – not what I would expect since the positive signs are associated with action and the negative signs with reaction.

Averaging the adjusted counts for the January’s, April’s, July’s, and October’s, months that are 2/3 cardinal and 1/3 fixed, the result is 627.3 divorces with an estimated standard error of 22.3. Averaging the counts for the February’s, May’s, August’s, and November’s, months that are 2/3 fixed and 1/3 mutable, the result is 611.0 divorces with an estimated standard error of 21.8. Averaging the counts for the March’s, June’s, September’s, and November’s, months that are 2/3 mutable and 1/3 cardinal, the result is 681.2 divorces with an estimated standard error of 22.8.

Looking at the three differences, between the first group and the second group – mainly cardinal versus mainly fixed – the difference is 16.3 with an estimated standard error of 34.4, between the first group and the third group – mainly cardinal versus mainly mutable – there is a difference of -53.9 with an estimated standard error of 35.8, and between the second group and the third group – mainly fixed versus mainly mutable – there is a difference of -70.2 with an estimated standard error of 35.3. The second and third tests are significant at the 0.15 level for two sided tests. The first test is not significant. The test results are in agreement with what one would expect from astrology. Cardinal and mutable signs are associated with change and adjustment, while fixed signs are associated with steadfastness.

The negative direction of the lag of 13, could be explained by the cycles of the Moon. There are approximately 13 Moon cycles per year, so for a given month, the Moon will start in a positive sign one year and in a negative sign the next, over reasonably short time periods. The estimated standard error for the means over the 13 lags is 37.0 except for the mean starting with the first observation, which has an estimated standard error of 30.0. There are 13 means from the 40 observations.

Some Concluding Diagnostic Plots

Below are some diagnostic plots, the plot of the residuals versus the fitted values, the Q-Q plot of the residuals versus normal variates, and a time series plot of the actual values and the fitted values.

residual vs fitted, QQ, and data with fitted plots

The residual versus fitted value plot looks like there is some correlation between the residuals and the fitted values. Testing the correlation gives a p-value of 0.0112, so the model is not ideal. While the model does not pick up most of the extreme data points, the residual and QQ normal plots do not show any great deviations from a model of normal errors with constant variance. The fitted values were found by subtracting the residuals from the actual data points.

The R Code

# First enter monthly divorce counts for Iowa from 8/06 to 11/09

div.ts = ts(c(619, 704, 660, 533, 683, 778, 541, 587, 720, 609, 622, 602, 
        639, 599, 657, 687, 646, 608, 571, 768, 703, 638, 743, 693, 
        624, 694, 660, 522, 639, 613, 446, 863, 524, 564, 716, 635, 
        649, 565, 567, 626), start=c(2006,8), freq=12)

# next enter the break point information: the number of days from the 
# last shift

vulcan.ts = ts(c(63, 94, 124, 155, 185, 216, 247, 275, 306, 336, 367, 397, 
                 428, 459, 489, 520, 550, 581, 612, 641, 672, 702, 733, 
                 763, 19, 50, 80, 4, 34, 65, 96, 124, 155, 185, 216, 246, 
                 277, 308, 3,  34), start=c(2006,8), freq=12)


# there are 1218 days in the 40 months.  the divorce counts are adjusted to
# reflect the number of days tha the county offices are open.

# first create a vector of days of the week, with 1 for Sunday, 
# 2 for Monday, etc.  8/06 started on a Tuesday.

wks = c(3:7, rep(1:7, 173), 1:2)

# next create a daily vector for the months, with 1 for for the first month, 
# 2 for the second month, etc.

mns = c(8:12, rep(1:12,2), 1:11)
mns.d = 1:40
mn =  c(31,28,31,30,31,30,31,31,30,31,30,31)
mns.dys = rep(mns.d,mn[mns])
mns.dys = c(mns.dys[1:553], 19, mns.dys[554:1217]) # add 2008 leap day

# next remove Sundays and Saturdays

mns.dys.r = mns.dys[wks>1 & wks<7]

# next use table() to count the weekdays in each month

mns.dys.s = c(table(mns.dys.r))

# next enter the number of government holidays in each of the 40 months

hol = c(0, 1, 0, 3, 1, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 2, 0, 0, 0, 
        1, 0, 1, 0, 1, 0, 3, 1, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3)

# next subtract out the holidays

mns.dys.s = mns.dys.s - hol

# next adjust the divorce counts

div.a.ts = div.ts * mean(mns.dys.s)/mns.dys.s

# next plot the two data vectors

data.d.v = cbind("jump point variable"=vulcan.ts, 
                      "adjusted divorce count"=div.a.ts)
plot(data.d.v, main="Adjusted Divorce Counts in Iowa with the Jump Point Variable
August 2006 to November 2009")

# next fit the simple linear regression

mod.lm = lm(div.a.ts~vulcan.ts)

summary(mod.lm)

# plot the autocorrelations and the partial autocorrelations 
# of the residuals from the model

acf(mod.lm$residuals, main="autocorrelation of the residuals
fitting divorce counts on Ceres, Juno, Vesta, and Pallas breaks")

pacf(mod.lm$residuals, main="partial autocorrelation of the residuals
fitting divorce counts on Ceres, Juno, Vesta, and Pallas breaks")

# plot the periodogram and cumulative periodogram of the residuals from 
# the model

spec.pgram(mod.lm$residuals, main="Residuals from Simple Regression
Raw Periodogram")

cpgram(mod.lm$residuals, main="Residuals from Simple Regression
Cumulative Periodogram")

# here is a function I created to compare models.  the function is set up 
# to only do moving averages.  It takes awhile  to run with or=13.

arima.comp=function(ts=div.a.ts, phi=vulcan.ts, or=3) {
  
  phi <- as.matrix(phi)
  if (nrow(phi)==1) phi <- t(phi)
  ind <- 2*or+ncol(phi)
  fx.lst <- list(0:1)
  #for (i in 2:(2*or)) fx.lst <- c(fx.lst,list(0:1))
  for (i in 2:or) fx.lst <- c(fx.lst,list(0:1))
  
  fx.mat <- as.matrix(expand.grid(fx.lst))
  #fx.phi1 <- matrix(NA, 2^(2*or), 1+ncol(phi))
  fx.phi1 <- matrix(NA, 2^or, 1+ncol(phi))
  
  fx.mat[fx.mat==1] <- NA
  fx.mat1 <- cbind(fx.mat,fx.phi1)
  fx.mat2 <- cbind(fx.mat,fx.phi1[,1])
  
  #res1 <- numeric(2^(2*or))
  #res2 <- numeric(2^(2*or))
  
  res1 <- numeric(2^or)
  res2 <- numeric(2^or)
  
  #for (i in 1:2^(2*or)) {
  for (i in 1:2^or) {
    
    #res1[i] <- arima(ts, order=c(or,0,or), xreg=phi, fixed=fx.mat1[i,])$aic
    #res2[i] <- arima(ts, order=c(or,0,or), fixed=fx.mat2[i,])$aic
    
    res1[i] <- arima(ts, order=c(0,0,or), xreg=phi, fixed=fx.mat1[i,])$aic
    res2[i] <- arima(ts, order=c(0,0,or), fixed=fx.mat2[i,])$aic
    
  }
  
  st1 <- order(res1)
  st1 <- st1[1:min(length(st1),10)]
  print(res1[st1])
  print(fx.mat1[st1,])
  
  st2 <- order(res2)
  st2 <- st2[1:min(length(st2),10)]
  print(res2[st2])
  print(fx.mat2[st2,])
  
}



# the final model was fit using arima().  in fixed, an NA says the 
# parameter is fit and a 0 says no parameter is fit.  in fixed there
# is a value for every value indicated by order and xreg. plus the 
# intercept.  so there are 13 moving average terms, three of which 
# are fit, and there is an intercept and one regressior term, 
# for 15 values in fixed.

mod.arima.r.1 = arima(div.a.ts, order=c(0,0,13), 
                    fixed=c(NA,0,NA,rep(0,9),NA,NA,NA), xreg=vulcan.ts) 

mod.arima.r.1

mod.arima.r.2 = arima(div.a.ts, order=c(0,0,13), 
                      fixed=c(NA,0,NA,rep(0,9),NA,NA))
mod.arima.r.2

# next the residuals from the two models are plottted together

ts.plot(cbind(mod.arima.r.1$residuals, mod.arima.r.2$residuals), col=4:3,
        xlab="Time", ylab="Residuals",
        main="Residuals for Two Moving Average Models")
legend("topright", c("With Regression Coefficient", 
                     "Without Regression Coefficient"), fill=4:3, cex=.7)

# next the estimated covariance matrix for the 40 observations is found,
# using the adjusted counts

cov.1 = matrix(0,40,40)
diag(cov.ua)= sum(c(1,mod.arima.r.1$coef[1:13]^2))

for (i in 1:12) {
  cov.1[cbind(c(1,(i+1)),c(i+1,1))]=
    sum(c(1,mod.arima.r.1$coef[1:(13-i)])*mod.arima.r.1$coef[i:13])
}
cov.1[cbind(c(1,14),c(14,1))] = mod.arima.r.1$coef[13]

for (i in 2:14) {
  cov.1[(i-1)+2:14, i] = cov.1[2:14, 1]
  cov.1[i, (i-1)+2:14] = cov.1[2:14, 1]
}

cov.1 = cov.1*mod.arima.r.1$sigma2

# next the average counts for sets of months along with the standard errors
# t.m is the matrix of multipliers for finding means and standard errors
# the rows of t are positive signs, negative signs, cardinal signs, fixed
# signs, and mutable signs

t.m = matrix(0,5,40)
t.m[1,] = rep(1:0, 20)
t.m[2,] = rep(0:1, 20)
t.m[3,] = c(rep(c(0,0,1),13),0)
t.m[4,] = c(rep(c(1,0,0),13),1)
t.m[5,] = c(rep(c(0,1,0),13),0)

m.m = t.m%*%div.a.ts/apply(t.m,1,sum)
s.m = sqrt(diag(t.m%*%cov.1%*%t(t.m)))/apply(t.m,1,sum)
           
m.m # the five means
s.m # the five standard errors

# next the four tests are done, with the same structure as in the last part
# the tests are positive vs negative, cardinal vs fixed, cardinal vs 
# mutable, and fixed vs mutable

t.t = matrix(0,4,40)
t.t[1,] = rep(c(1/20,-1/20),20)
t.t[2,] = c(rep(c(0,-1/13,1/13), 13), 0)
t.t[3,] = c(rep(c(-1/14,0,1/13), 13),-1/14)
t.t[4,] = c(rep(c(1/14,-1/13,0), 13), 1/14)

m.t = t.t%*%div.a.ts
s.t = sqrt(diag(t.t%*%cov.1%*%t(t.t)))

z.sc = m.t/s.t
p.z.sc = pnorm(z.sc)
m.t # the values of the four test statistics
s.t # the estimated standard errors for the four test statistics

z.sc # the z scores
p.z.sc # the p-values of the z scores

# next, the means and standard errors for the 13 lag means are found

t.14 = c(1,1+seq(13,40,13))
t.13 = c(2,2+seq(13,27,13))

m.13 = matrix(0,13,40)
m.13[1,t.14] = 1/4
for (i in 2:13) m.13[i,t.13+(i-2)] = 1/3

mn.13 = m.13 %*% div.a.ts
sd.13 = sqrt(diag(m.13 %*% cov.1 %*% t(m.13)))

mn.13
sd.13

# next, the diagnostic plots are plotted

par(mfrow=c(1,3))

plot(as.numeric(div.a.ts-mod.arima.r.1$residuals), 
     as.numeric(mod.arima.r.1$residuals),
     main="Plot of Residuals versus Fitted Values
for the Adjusted Divorce Data", xlab="Fitted Value", ylab="Residual", 
     font.main=1)

qqnorm(mod.arima.r.1$residuals, main="QQ Normal Plot for
Residuals from the Divorce Model", font.main=1)

ts.plot(div.a.ts,div.a.ts-mod.arima.r.1$residuals,
        ylab="Counts", xlab="Year",
        main="Adjusted Divorce Counts and Fitted Counts", 
        col=4:3)
legend("bottomleft", c("Adjusted Divorce Counts",
       "Fitted Adjusted Divorce Counts"), col=4:3, lwd=2, cex=.7)

# last, the correlation between the fitted values and the 
# residuals are found

cor.test(div.a.ts-mod.arima.r.1$residuals, mod.arima.r.1$residuals)


Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s