I Am Back

Sorry for my long absence. I had some health problems. I will get back to the economic problem soon.

Storms

Here is a plot of the costs of weather disasters in the USA since 1980 – data from the NOAA website, https://www.ncdc.noaa.gov/billions/time-series . These are the disasters costing more than $1,000,000,000. I fit the exponential trend by fitting a linear trend to the log of the cost data plus 1 (since there was a year with no billion dollar storms and the log of zero is minus infinity, while the log of 1 is zero), then taking the exponential of the fitted line and subtracting 1 from the result. I find the result scary, particularly with the costs of 2005 and 2017. These costs are inflation adjusted.

The Greenhouse Effect

Here is a plot that I did last year of a time series model, where I fit temperature anomaly to sunspot numbers, carbon dioxide levels, and the ENSO index.

Here is the R output.

Call:

arima(x = temp.anomaly.7.71.to.11.13, 
          order = c(3, 0, 0), 
          seasonal = c(2, 0, 0), 
          xreg = xmat)
Coefficients:
       ar1   ar2    ar3    sar1   sar2  intercept enso co2   sunspot
     0.4738 0.1751 0.0954 0.1187 0.1391 -2.9108 0.0563 0.0089 5e-04
s.e. 0.0241 0.0265 0.0242 0.0246 0.0243  0.1573 0.0074 0.0005 2e-04
sigma^2 estimated as 0.01281
log likelihood = 1297.64 
aic = -2575.27

The temperature anomaly is from Berkeley Earth of monthly global average temperature data for land and ocean, http://berkeleyearth.org/data. The temperature series goes from July 1871 to November 2013.

The data for the sunspots come from the WDC-SILSO at the Royal Observatory of Belgium, Brussels, http://sidc.oma.be/silso/datafiles, where the data is the previous data from the site (the data changed in July of 2015). The sunspot series goes from April 1871 to August 2013.

The data for the CO2 level is from the NOAA website, The CO2 level has been smoothed by a twelve month moving average and some work went into combining the two series. See https://vanwardstat.wordpress.com/2015/08/14/fitting-sunspot-numbers-and-co2-concentrations-to-temperature-anomalies/ The data go from July 1871 to November 2013.

The index for the ENSO, at NOAA, has been averaged for the years for which the two data sets overlap and the resulting series has been shifted to the right by one half month using the average of two consecutive values. The data are at https://www.esrl.noaa.gov/psd/enso/mei.ext/table.ext.html and https://www.esrl.noaa.gov/psd/enso/mei/table.html. The seasonal component of the ENSO index was removed with a twelve month moving average. The data run from July 1871 to November 2013.

This is a quick and dirty regression. That the model fits so well is remarkable. Maybe someone who is getting paid for the work will do a more careful job.

Advertisements

Wages and Inflation 1964 to 2017

Many years ago, in the mid 1970’s, I was living during a time of high inflation and energy price shocks.  At that time, I thought that the energy shocks and inflation were causing real wage income to decrease, by way of inflation, and that the distribution of wealth was shifting in such a way that a person’s control over their income picked the winners and losers.

I did some research on the topic for my creative component at Iowa State in 1985 – for my master’s degree – but the subject has lain fallow in my mind since then, with my having made occasional forays into the data.  This the the beginning of looking at the problem again. Below are CPI-All Urban Consumers (Current Series),  https://data.bls.gov/timeseries/CUUR0000SA0 , and three sets of data from https://www.bls.gov/webapps/legacy/cesbtab8.htm  – bullet list 8, giving average hourly wages for all non-agricultural private workers, goods producers, and service providers.  All the datasets contain monthly figures from January 1964 to October 2017 and are on the Bureau of Labor Statistics website.

inflation and average adjusted wages 1964 to 2017

From the plots, real income was increasing strongly up until around the 1973 Oil Embargo, dropped, rose, fell until the mid to late 1990’s and have been on an upward trend since then, except for the last few years.  Also, producers of goods make more money than service providers, on average.

More on this next month.

Principal Component Analysis: An Example

Principal component analysis is a technique from multivariate analysis for data sets with numeric (as opposed to categorical) variables. If you are interest in how variables in a data set relate or you are dealing with multicollinearity, you would probably find principle component analysis useful. According to the Wikipedia entry on principal component analysis, principal component analysis was invented by Karl Pearson in 1901. The purpose of principal component analysis is to reduce the number of dimensions of a set of measurements.

What Principal Component Analysis Involves

Say we have a data set of four measurements on 62 observations. The data set then has four dimensions in the columns. Using principal component analysis, we can find four vectors composed of weighted linear combinations of the original four measurements which contain the same information as in the original four measurement vectors, however, which combinations are linearly independent. (If a set of vectors are linearly independent, then the cross-product matrix of the vectors is diagonal.) Usually, the first few linear combinations explain most of the variation in the original data set. Say, in our example, that the first two linear combinations explain 85% of the variation in the data set. Then, we could approximate the original data using the first two vectors. The vectors are called either the first two principal components or the scores of the first two principal components. The weights used to find the scores are called the loadings of the principal components.

The principal components can be used in regression analysis. In the example above, the first two principal components could be used in place of the four original measurements to get rid of problems from multicollinearity. The principal components can also be used in cluster analysis. By plotting the first two scores against each other and labeling the points with the row labels, one can see if different observations cluster together. By plotting the first two loadings against each other, we can see if there is clustering within the measurement variables. The principal components can be used to see underlying relationships within the data. The loadings can have meaning in terms of understanding the data.

The Mathematics Behind Principal Component Analysis

Given a data set, the loadings are the eigenvectors of either the variance-covariance matrix or the correlation matrix of the data set. The two methods do not give the same result. Usually, the correlation matrix is used because in the correlation matrix the variables are normalized for size and spread. The scores are found by multiplying the original data – the normalized data if the correlation matrix is used – on the right by the eigenvectors. In the matrix of eigenvectors, the eigenvectors are in the columns.

Eigenvalues have an important role in principal component analysis. Eigenvectors are sorted by the sizes of the corresponding eigenvalues, from largest to smallest. An eigenvalue using a correlation matrix is proportional to the proportion of the variation in the data set explained by the principal component associated with the eigenvalue. For a full rank correlation matrix, in most programs, the sum of the eigenvalues equals the dimension of the correlation matrix, which equals the number of columns in the data set. The magnitudes of the elements of the eigenvectors are relative with respect to the magnitudes of the eigenvalues. The magnitudes of either, separately, are not determinate. In the above explanation, we assume the eigenvectors are normalized such that the inner product of each eigenvector equals one, which is usual.

An Example Using the Deficit by Political Party Data Set

We will use the data in The Deficit by Political Party Data Set* to demonstrate the power of principal component analysis as a clustering tool. We look at the size of the on budget deficit (-) (surplus (+)) as a percentage of the gross domestic product using the political affiliation of the President and the controlling parties of the Senate and the House of Representatives. Since the budget is created in the year before the budget is in effect, the deficit (-) (surplus (+)) percentages are associated with the political parties in power the year before the end of budget year. We will only look at years between the extreme deficits of the World War II years and the current extreme deficits.

In order to code the categorical values for political party numerically, we let Democrats take in the value of ‘1’ and Republicans take on the value of ‘-1’. Our data set has four variables and 62 observations. The variables are the deficit (-) (surplus (+)) as a percentage of the gross domestic product for the years 1947 to 2008, the political party of the President for the years 1946 to 2007, the political party controlling the Senate for the years 1946 to 2007, and the political party controlling the House for the years 1946 to 2007. Below are two plots of the rotation of the four variable dimensions projected onto the two dimensional plane using the first two eigenvectors, each multiplied by the square root of the eigenvector’s eigenvalue, one for the covariance method of calculation and the other for the correlation method.

First two eigenvectors plotted against each other - with and without normalization

Normally, in principal component analysis, a biplot is created containing plots of both the loadings and the scores. Because three of our variables are dichotomous, the plots of the scores are not very interesting and I have not included the plots. For the covariance method, the first two principal components account for 86% of the variation, and for the correlation method, the first two principal components account for 79% of the variation.

We can see that the parties of the Presidents and the deficits are strongly correlated with each other in either plot, while the Senates and Houses are more weakly correlated with each other in the correlation plot but strongly correlated in the covariance plot. The deficits and the parties of the Presidents do not seem to be strongly correlated with the parties of the Senates or Houses in either plot. In the covariance plot, the lengths of the vectors differ much more than in the correlation plot, since the measurements are not standardized. Basically, the two plots give the much same information.

Below is a table of the eigenvectors and eigenvalues for the correlation method.

Eigenvectors and Eigenvalues for the Correlation Method
variable eigenvector 1 eigenvector 2 eigenvector 3 eigenvector 4
deficit 0.380 0.583 0.712 0.090
president 0.378 0.587 -0.701 0.145
senate -0.539 0.487 -0.024 -0.687
house -0.650 0.279 0.029 0.706
eigenvalue 1.683 1.462 0.506 0.350

We can see from the table that the first eigenvector is a contrast between the deficit / President and the Senate / House.  Democrats have a positive value and Republicans a negative value in the normalized matrix of measurements. So, for the deficit and the party of the President, deficits that are positive in the normalized data are positively associated with Democratic presidents and deficits that are negative in the normalized data are positively associated with Republican presidents. For the House and Senate, Republican Senates are associated with Republican Houses and Democratic Senates are associated with Democratic Houses. Together, positive normalized budget deficits and Democratic presidents align with Republican Senates and Houses and contrast with Democratic Senates and Houses.

The second eigenvector lumps together, weakly, the deficit, President, Senate, and House. Here, Democrats, at all levels, are aligned with positive values of the normalized deficit and for Republicans, negative deficits are associated with Republicans at all levels.  Also, Democrats are aligned with Democrats and Republicans are aligned with Republicans.

The third eigenvector is mainly a contrast between the deficit and the party of the President. Positive normalized deficits are contrasted with Democratic presidents and aligned with Republican presidents.

The fourth eigenvector is mainly a contrast between the parties of the Senate and House. Democratic Senates are contrasted with Democratic Houses and aligned with Republican Houses and vice versa. The last two eigenvectors only account for a small portion of the variation.

*The Deficit by Political Party Data Set

The deficit by political party data set contains data on the total and on balance deficits, the gross domestic product, and the political parties controlling the presidency, the senate, and the house for the years 1940 to 2011. The data for the deficits and the gross domestic product are from the Office of Management and Budget and can be found at http://www.whitehouse.gov/omb/budget/Historicals. Before 1978, the budget year ran from July 1st to June 30th. From 1978 onward, the budget year ran from October 1st to September 31st. The tables contain a one quarter correction between the years 1977 and 1978, which I have ignored. The data for 2011 is estimated. The year with which the deficit and gross domestic product are associated is the year at the end of the budget year for which the deficit and gross domestic product are calculated.

The data for the controlling political parties of the president, senate, and house were taken from the website of Dave Manuel, http://www.davemanuel.com/history-of-deficits-and-surpluses-in-the-united-states.php, a good resource on the history of the deficit.

FITTING A LOG LINEAR MODEL

You can use log linear models to fit cross classified catagorical data. In fitting log linear models, the logarithm of the expected counts in the cells of the contingency table are modeled as linear combinations of parameters, where there can be parameters for interactions between dimensions in the table. I follow the approach of Bishop, Fienberg, and Holland in Discrete Multivariate Analysis: Theory and Practice, MIT Press, 1975.

Some Terminology

Some terminology is in order. An hierarchical model is a model in which, if a parameter is not in the model, then neither are the higher level interactions associated with the parameter. A complete model is a model in which none of the expected cell counts are equal to zero. An incomplete model has at least one cell where the expected cell count is zero. A saturated model is a model in which all possible parameters are fit. An unsaturated model is a model in which some of the parameters are set to zero. In Bishop, et.al., each dimension of the table is called a factor and only hierarchical models are fit, which can be either complete (chapter 3) or incomplete (chapter 5). The parameters are modeled such that the sum over the indices for a given parameter and factor equals zero (except of course the first parameter, ‘u’).

The Saturated Model

Following the notation of Bishop, et.al., for a three dimensional contingency table, the saturated model would be

log(mijk) = u + u1(i) + u2(j) + u3(k) + u12(ij) + u13(ik) + u23(jk) + u123(ijk),

where ‘mijk‘ is the expected cell count in cell (i,j,k).

Say that we have two levels for each of the three factors. Then, the indices associated with the factors, ‘i’, ‘j’, and ‘k’, can each take on either the value ‘1’ or the value ‘2’. For this example, since there are only two levels for each factor and the parameters sum to zero over each index, specifying a parameter (say u12(11)) for a given set of factors (here, the interaction between factors 1 and 2) means that the other parameters for the given set of factors (in this case, u12(12), u12(21), and u12(22)) are specified too. It follows that, for this example, there are eight unique parameters in the saturated model, one for each factor combination (‘1′,’2′,’3′,’12’,’13’,’23’, and ‘123’) and one for the unindexed parameter ‘u’.

There are eight possible unique sets of values for the three indices ‘i’, ‘j’, and ‘k’ ((1,1,1), (1,1,2), … , (2,2,2)), representing the eight cells in the table. If ‘i’, ‘j’ and ‘k’ are given, so are ‘ij’, ‘ik’, ‘jk’, and ‘ijk’. This means that the eight estimated cell counts are uniquely defined. In the model, there are eight unique parameters, and there are eight cell counts. It follows that the saturated model fits the data exactly – the estimates of the expected cell counts are the cell counts themselves.

An Unsaturated Models

The saturated model is not usually of interest. For unsaturated models, in order to estimate the expected cell counts, Bishop, et.al., model the data in the cells as either independent Poisson variables or as variables from a single multinomial distribution, and find minimal sufficient statistics for the expected cell counts based on either of the two probability distributions. Under the assumptions, the minimal sufficient statistics are the same for either distribution.

An example of an unsaturated model based on the saturated model given above is

log(mijk) = u + u1(i) + u2(j) + u3(k) + u13(ik) + u23(jk) .

Here, the main effects of each of the factors are included (In Bishop, et.al., the main effects are always included. According to Bishop, et.al., such models are called comprehensive models.) The only higher order effects which are still present are the interactions between the first and third factors and between the second and third factors. Note that, since the u12(ij) term is not in the model, the term u123(ijk) cannot be either if the model is to be heirarchical. For an unsaturated model, Bishop, et.al., give an algorithm to find maximum likelihood estimators for the expected cell counts using the minimal sufficient statistics of the given unsaturated model. The maximum likelihood estimators do not depend on which probability distribution is used. (The results that I give below are based on an R function that I wrote to apply the algorithm.)

Evaluating an Unsaturated Model

Bishop, et.al., follow the method of evaluating the fit of an unsaturated model by using the chi squared distribution of either

Χ2 = Σ ( xi – ̂mi )2 / ̂mi

or

G2 = Σ -2 xi log( ̂mi / xi ) ,

where xi is the cell count in cell i and ̂mi is the estimated expected cell count in cell i, and i goes over all of the cells in the array for which ̂mi does not equal zero. Χ2 then is the Pearson chi square statistic and G 2 is minus twice the log likelihood ratio. Χ2 and G2 are goodness of fit statistics. A model fits if the chi square statistic falls in the ‘normal’ range for a chi square of the appropriate degrees of freedom. Bishop, et.al., only look at the right tail of the appropriate chi square distribution, accepting models that fit the data too well. Among the models considered, the most parsimonious model that still fits the data is accepted.

Fitting a Loglinear Model to the Deficit by Political Party Data Set

To demonstrate fitting a log linear model, I will use the data in the Deficit by Political Party Data Set. The deficits (surpluses) as a percentage of the gross domestic product are classed into four classes and the political parties in control are associated with the deficit (surplus) of the following year. The political parties go from 1946 to 2007 and the deficits (surpluses) as a percentage of gross domestic product go from 1947 to 2008. The data array is given below.

On Budget Deficit (-) (Surplus (+)) per GDP
by the Controlling Political Parties
of the Presidency, Senate, and House

Political Parties Lagged One Year

President Senate House (-6.2,-4] (-4,-2] (-2,0] (0,4.2]
D D D 0 7 10 2
R
R D
R 0 1 4 3
R D D 5 8 5 3
R 1 1 0 0
R D 4 2 2 0
R 1 3 0 0

Since the interaction between the deficit (surplus) is of interest and the senate and house seem to be interrelated, the following models were fit.

1: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14+ u23 + u24 + u34

2: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14 + u34

3: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u13 + u14

4: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u14 + u34

5: log(mijk) = u + u1 + u2 + u3 + u4 + u12 + u14

6: log(mijk) = u + u1 + u2 + u3 + u4

The subscript 1 refers to the deficit (surplus) classes and takes on 4 levels, the subscript 2 refers to the party of the president and takes on 2 levels, the subscipt 3 refers to the party controlling the senate and takes on 2 levels, and the subscript 4 refers to the party controlling the house and takes on 2 levels. The G2 values found by fitting the six models, as well as the degrees of freedom for and the p-values from the chi squared tests, are given in the table below.

Model G2 df p-value
1: 9.14 5 0.1036
2: 10.83 7 0.1460
3: 16.79 8 0.0324
4: 12.19 10 0.2725
5: 17.47 11 0.0947
6: 37.66 17 0.0027

From the table, models 1, 2, 4, and 5 all fit the data. Chi squared tests to compare the models
are given in the table below.

Models Difference
in G2
Difference
in df
p-value
1: & 2: 1.69 2 0.4296
2: & 4: 1.36 3 0.7149
4: & 5: 5.28 1 0.0216

There is little difference between models 1 and 2, and 2 and 4, so out of models 1, 2, and 4, the most parsimonius model is model 4. Model 5 is more parsimonious than model 4, however the test of the difference between the models shows a significant difference from the central chi squared distribution with 1 degree of freedom. It appears that something is lost when going from model 4 to model 5. We conclude that model 4 is the best model and that the interaction between the house and senate, the interaction between the deficit level and the party of the president, and the interaction between the deficit level and the party controlling the house are important and that the interaction between the party controlling the senate and the deficit level is not.

Below is a table of the fitted cells, normalized to sum to 100 across rows, for model 4.

Estimated Expected Cell Counts Normalized by Row
Using Model 4 for the On Budget Deficit (-) (Surplus (+)) per GDP
by the Controlling Political Parties
of the Presidency, Senate, and House

Political Parties Lagged One Year
Rows Sum to 100 Except for Rounding

President Senate House (-6.2,-4] (-4,-2] (-2,0] (0,4.2]
D D D 00 30 55 15
R
R D
R 00 29 42 29
R D D 33 40 21 07
R 26 42 17 15
R D 33 40 21 07
R 26 42 17 15

Democratic presidents perform much better than Republican presidents. With regard to Republican presidents, there is no difference between Democratic and Republican senates. Republican houses perform somewhat better than Democratic houses under both Democratic and Republican presidencies.

FOR ENERGY RESEARCHERS:

Sampling Plans – Analysis of Samples – Modeling – Graphing

Below are graphs of altitudes and azimuths of the sun, for differing times of the year, for Stratford, Iowa, which give you an example of the capability of R for use in planning solar installations:

altitude-azimuth

You can download the R function – which plots individual altitude / azimuth graphs, for any place in the world, for any time of year. You can download the file here: angle.plot.

If you call the function ‘angle.plot’, to run the program with default values, you would type ‘angle.plot()’ at the R prompt. To change the defaults, you can manually change the defaults when the function is run; for example, angle.plot(loc=”Hull, MA”,  mo=8,  da=28,  ye=2013,  tz=4,  la=42.3019,  lo=70.9083).

function(loc="Stratford, Iowa", mo=1, da=1, ye=2010, tz=6, 
la=42.26782, lo=93.92097) {

#  The algorithms for this function come from Astrological Algorithms, 2nd ed.,
#  by Jean Meeus, published in 1998, by Willmann-Bell, Inc.  The page 
#  references are to Meeus. 
   
#  check the inputs

      cat("the location entered is ", loc, "\n")
      cat("the date entered is ", mo, "/", da, "/", ye, "\n")
      cat("the time zone is ", tz, " hours from UT", "\n")
      cat("the latitude is ", la, "degrees and the longitude is ", lo, 
          " degrees\n", "\n")
      cat("  if not correct then enter n ")
      tes <- readline()
      if(tes == "n") stop("change the value of the arguments")
	
#
#  define tosid - the solar to sidereal time correction factor (p. 408)
#
      tosid <- 366.2422 / 365.2422	
#
#  create local - the local dynamical times for the plot - in degrees 
#  (see chapter 10 in Meeus)
#
      local <- seq(-150, 150, by = .02)	
#
#  find jd - the julian days for the local dynamical times - in days (p. 61)
#
  if(mo == 1 | mo == 2) { 
    mo <- mo + 12
    ye <- ye - 1
   }
  ga <- floor(ye/100 + 1e-008)
  gb <- 2 - ga + floor(ga/4 + 1e-008)
  jd <- (floor(365.25 * (ye + 4716)) + floor(30.6001 * (mo + 1)) +
       (da +.5) + gb - 1524.5) + local/(15 * 24) + lo/(15 * 24)
     
#
#   find T - the increment from January 1, 2000 at noon TD
 
#            for the local dynamical times - in centuries (p. 163)
#
     T <- (jd - 2451545)/36525
     Tsq <- T^2
#
#   find L0 - the mean longitudes of the sun - in radians (p. 163)
#
           L0 <- (280.46646 + 36000.76983 * T + 0.0003032 * Tsq)/180 * pi
#
#   find M - the mean anomolies of the sun - in radians (p. 163)
#
      M <- (357.52911 + 35999.05029 * T - 0.0001537 * Tsq)/180 * pi
#
#   find C - the sun's equations of the center - in radians (p. 164)
#
      C <- (1.914602 - 0.004817 * T - 1.4e-005 * Tsq) * sin(M)
      C <- C + (0.019993 - 0.000101 * T) * sin(2 * M)
      C <- (C + 0.000289 * sin(3 * M))/180 * pi
#
#   find omega - the correction factors for finding the apparent longitudes
#                of the sun - in radians (p. 164)
#
      omega <- (125.04 - 1934.136 * T)/180 * pi 

#   find lambda - the apparent longitudes of the sun - in radians (p. 164)
#
      lambda <- L0 + C - (0.005689999999999997 + 
                          0.004779999999999999 * sin(omega))/180 * pi
#
#   find epsilon - the apparent obliquities of the ecliptic - in radians 
#   (pp. 144-147)
#
      Lprime <- (218.3165 + 481267.8813*T) / 180 * pi
      deltapsi <- (-17.2*sin(omega) - 1.32*sin(2*L0) - 
                   0.23*sin(2*Lprime) + 0.21*sin(2*omega))/3600
      deltaep <- (9.2*cos(omega) + 0.57*cos(2*L0) + 0.10*cos(2*Lprime) - 
                  0.09*cos(2*omega))/3600
      epsilon <- (23.439291111 - 0.013004167 * T - 1.638889e-07 * Tsq + 
                  5.036111111e-07 * Tsq * T + deltaep)/180 * pi
#   find alpha - the apparent right ascensions of the sun - in radians (p. 165)
#
      alpha <- atan2(cos(epsilon) * sin(lambda), cos(lambda)) 
#
#   find delta - the apparent declinations of the sun - in radians (p. 165)
#
      delta <- asin(sin(epsilon) * sin(lambda))	
#
#   find e - the eccentricities of the earth's orbit - unitless (p. 163)
#
      e <- 0.016708634 - 0.000042037 * T - 0.0000001267 * Tsq 
#
#   find deltaT - the factors to convert to mean times from dynamical times
#                 and vice versa (referenced to solar time) - 
#                 in degrees (p. 78)
 
#
      deltaT <- (65 + 88*T + 25.3*T^2)/3600*15 
#
#   find jd_UT - the julian day for 0h UT on the day of the plot - in days
#   and T_UT - the increment from 1/1/2000 noon UT to jd_UT - in centuries  
#   (p. 87)
# 
      jd_UT <- jd[local==0] - lo/(15 * 24) - .5
      T_UT <- (jd_UT -2451545)/36525
#
#   find THETA0 - the mean sidereal time of the point due south at Greenwich
#                 at 0h UT on the day of the plot - in radians (p. 87)
#
      THETA0 <- 100.46061837 + 36000.770053608*T_UT 
      THETA0 <- (THETA0 + 0.000387933*T_UT^2 - (T_UT^3)/38710000)/180*pi
#
#   find theta - the apparent sidereal times of the points due south at 
#                the location for the local dynamical times - 
#                in radians (p. 88)
#
      theta0 <- THETA0 + (lo + local + 180 + deltaT)*tosid/180*pi
      theta <- (theta0 - lo/180*pi) + (deltapsi*cos(epsilon))/180*pi
#
#   find H - the hour angles of the sun for the local dynamical times - 
#            in radians (referenced to sidereal time) (p. 92)
#
       H <- (theta - alpha)%%(2*pi)
       H[H>pi] <- H[H>pi]-2*pi	
#
#   convert the geographical latitude to radians
#
      phi 
#
#   find the altitudes and azimuths of the sun - in degrees (p. 93)
# 
     azimuth <- atan2(sin(H), (cos(H) * sin(phi) - tan(delta) * 
                      cos(phi)))* 180 / pi
     altitude <- asin(sin(phi) * sin(delta) + 
                      cos(phi) * cos(delta) * cos(H)) * 180 / pi
#
#   convert local - the local dynamical times - to Time - the civil times - 
#   in solar hours (p. 77)
#
      Time <- (local - deltaT + 180 + lo)/15 - tz
#
#   make the plot
#
      par(mar = c(6, 6, 7, 7) + 0.1, font.main=1)
      plot(Time[altitude>=0], altitude[altitude>=0], axes = F, type = "l",
 
      lty = 1, xlab="", ylab = "", 
      xaxs = "r", xaxt = "n", yaxt = "n", yaxs = "r", ylim = c(-10, 90),
      xlim = c(4, 22), col=2)
      axis(1, at = c(4, 6, 8, 10, 12, 14, 16, 18, 20, 22), labels = c("4AM",
           "6AM", "8AM", "10AM", "12PM", "2PM", "4PM", "6PM", "8PM",
           "10PM"), tck = 1)
      axis(2, at = c(-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90), labels = c(
           "-10", "0", "10", "20", "30", "40", "50", "60", "70", "80",
           "90"), tck = 1, adj = 1)
      legend(3.4,90, c("Altitude", "Azimuth"), lty=1,
           col = 2:3, bg="white")
      mtext(side = 2, line = 3, "Altitude of the Sun (degrees)")
      par(new = "T")
      plot(Time[altitude>=0], azimuth[altitude>=0], axes = F, type = "l", 
          lty = 1,
          xaxt = "n", ylim = c(-140, 140), xlim = c(4, 22), ylab = "", 
          xlab="", col=3)
      axis(side = 4, at = c(-120, -90, -60, -30, 0, 30, 60, 90, 120), labels
           = c("120 E", "90 E", "60 E", "30 E", "0", "30 W", "60 W", "90 W",
           "120 W"), adj = 0, cex.axis=.7)
      mtext(side = 4, line = 3, "Azimuth of the Sun (degrees)", srt=270)
      mtext(side = 3, line = 2, paste(loc, "\n \n", "Latitude =", 
            round(la, 5), "  Longitude =", round(lo, dig = 5), " "))
      mtext(side = 1, line = 3, "Time")
      box()
#
#   print out some variables
#
      for( i in 1:(length(local)-1) ) {
            if (altitude[i+1] >= 0 & altitude[i] < 0) {kr              
            if (altitude[i+1] >= altitude[i]) {kn 
            if (altitude[i+1] < 0 & altitude[i] >= 0) {ks 
            }
      cat(
      "L0 ",(L0[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "M ",(M[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "C ",(C[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "omega ",(omega[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "lambda ",(lambda[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "epsilon ",(epsilon[c(kr,kn,ks)]*180/pi)%%360,"\n",
      "alpha ",(alpha[c(kr,kn,ks)]*180/pi/15)%%24,"\n",
      "delta ",delta[c(kr,kn,ks)]*180/pi,"\n",
      "e ",e[c(kr,kn,ks)],"\n",
      "deltaT ",(deltaT[c(kr,kn,ks)]*180/pi/15)%%24,"\n",
      "THETA0 ",(THETA0*180/pi/15)%%24,"\n",
      "theta0 ",(theta0[c(kr,kn,ks)]*180/pi/15)%%24,"\n",
      "theta ",(theta[c(kr,kn,ks)]*180/pi/15)%%24,"\n",
      "H ",(H[c(kr,kn,ks)]*180/pi/15),"\n",
      "azimuth ",azimuth[c(kr,kn,ks)],"\n",
      "altitude ",altitude[c(kr,kn,ks)],"\n",
      "Time ", Time[c(kr,kn,ks)],"\n",
      "rise ",kr,"noon ",kn,"set ",ks,"\n")
}

Fitting Sunspot Numbers and CO2 Concentrations to Temperature Anomalies

In this post, I give a simple model for temperature anomaly for data from 1850 to 2013.  The temperature anomaly data is the land and ocean series from Berkeley Earth; the sunspot data is the old monthly averages of daily sunspot numbers from the Royal Observatory in Belgium; and the CO2 data come from two series of the NOAA ESRL, one series – using ice cores – is of yearly CO2 levels from 1850 to 1958 and the other of monthly levels from the Mauna Loa Observatory in Hawaii from March of 1958 to 2013.  I created a time series out of the two CO2 series by adding the seasonal components to each yearly point in the earlier series, where the seasonal components are found using the later series.  I then filtered the combined series with a linear filter 129 months wide.  I also tried a filter width of 12 months, but the results were essentially the same.  I smoothed the CO2 data because the seasonal variation for the CO2 data is local to Hawaii and the other data is global..  The first sunspot observation is associated with the fourth temperature anomaly observation and the fourth CO2 observation, since I have observed that temperature anomaly is lagged three months behind sunspot number, in terms of correlation.

Fit of Temperature Anomaly to SDunspot numbers and CO2 Concentrations 1850 to 2013

Together, the two variables, sunspots and CO2 account for about 70% of the variation in temperature anomaly, where variation is measured by sums of squares. The regression indicates an increase of 0.045 degrees C for every 100 more sunspots and an increase of 0.92 degree C for every 100 ppm increase in the level of CO2.  The fit does agree with the last ten years of the hiatus in temperature growth (Actually it does not.  I thought it did.)  .

While the fit is not perfect to the linear regression assumptions, the fit is not too bad.  The diagnostic plots from R are given below.

Diagnostic Plots

This is a very simple model, and does not account for other greenhouse gases or other component of the atmosphere, or the composition of the different places on the earth. That said, temperature is a measure of energy, they are linearly related.  So using sunspots which, for the data we have, are a good proxy for energy density from the sun, makes sense.  I do not know enough physics  to say if the energy returned to the earth by greenhouse gases is linearly related to greenhouse gas concentration, but the plots indicate probably yes.

Corrected Plots for Filtered Data and Phases for the 1850 to 2015 Data and the 1750 to 2015 Data

When I filtered the sunspot and temperature data, I used a bandwidth of 257 months rather than 129 months.  The corrected plots are below.

cc.ra.ss.ta.plot129 filtered correctly

cc.corrected.phase.centered.ss.ta

Once again, the sunspot data is the mean of the daily sunspot numbers over each month from the Royal Observatory in Belgium and the temperature anomaly data is from Berkeley Earth and is the land and ocean series.

Here are the plots for the new Belgian series and the Berkeley Earth temperature anomaly series for land data, which goes back to 1750.

cc.filt.new.corrected


cc.phase.centered.new.corrected.ss.ta

The contrast between the two sets of series may be because the temperature anomaly data in the first set is land and ocean data, while the temperature data in the second – longer – set is just land data.  The radiation from the ocean is much more uniform than the radiation from  land.  Also, I do not think we are certain that the observed fluctuation in the solar radiation which has been seen to follow the sunspot cycle during the time that we have had satellite measurements – since 1978 I believe – actually followed sunspot numbers closely over the past.