Scipy Stats Project: 2011

Monday, August 22, 2011

SVAR Estimation

For the next phase of my GSoC project, I integrated SVAR estimation with restrictions on the within period effects and shock identification. This follows the method outlined in section 11.6 of Hamilton (1994) [1]. In order to show how the system works, I will work through an example.

Structural VARs of the type I will be positing here are best suited to situations where impulse response function shocks need to be identified, but there is little to no theoretical justification for the within period dynamics required for orthogonalization. While this allows greater flexibility to the researcher in specifying the model, it also leaves greater room for error, especially when specifying what parameters can be assumed zero and what should be estimated, and whether or not this specification meets the criteria for shock identification.

Given time series data, SVAR class, initiation requires (at a minimum):

1) svar_type
2) A, B matrices marking with an 'E' or 'e' where parameters are unknown

This must be A,B, or AB. An 'A' system assumed that the matrix premulitplying the error matrix is the identity matrix and vice-versa for the 'B' system. In an 'AB' system, elements in both the A and B matrix need to be estimated. The system can be summarized as:

$Ay_t = c + A_1y_{t-1} + A_2y_{t-2} + \cdots + A_py_{t-p} + B\varepsilon_t$

We estimate A, B in a second stage after A_1-A_n, /Sigma_u are estimated via OLS, by maximizing the log-likelihood:

$\dpi{80} \log{L} = - \frac{Tn}{2}\log(2π) + \frac{T}{2}\log|A|^2 - \frac{T}{2}\ln|B|^2 - \frac{T}{2}trace(A'B'^{-1}B^{-1}A\Sigma_u)$

We could set up an example system as:

In [1]: import numpy as np

In [2]: A = np.array([[1, 'E', 0], [1, 'E', 'E'], [0, 0, 1]])

In [3]: A
Out[3]: 
array([['1', 'E', '0'],
       ['1', 'E', 'E'],
       ['0', '0', '1']], 
      dtype='|S1')

In [4]: B = np.array([['E', 0, 0], [0, 'E', 0], [0, 0, 'E']])

In [5]: B
Out[5]: 
array([['E', '0', '0'],
       ['0', 'E', '0'],
       ['0', '0', 'E']], 
      dtype='|S1')

In order to aid numerical maximum likelihood estimation, the SVAR class fit can also be passed guess matrices for both A and B parameters.

Building upon the previously used three variable macro system, a simple example of an AB system estimation is as follows:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR, SVAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)

#define structural inputs
A = np.asarray([[1, 0, 0],['E', 1, 0],['E', 'E', 1]])
B = np.asarray([['E', 0, 0], [0, 'E', 0], [0, 0, 'E']])
A_guess = np.asarray([0.5, 0.25, -0.38])
B_guess = np.asarray([0.5, 0.1, 0.05])
mymodel = SVAR(data, svar_type='AB', A=A, B=B, names=names)
res = mymodel.fit(maxlags=3, maxiter=10000, maxfun=10000, solver='bfgs')
res.irf(periods=30).plot(impulse='realgdp', plot_stderr=False)

From here, we can access the estimates of both A and B:

In [2]: res.A
Out[2]: 
array([[ 1.        ,  0.        ,  0.        ],
       [-0.50680204,  1.        ,  0.        ],
       [-5.53605672,  3.04117688,  1.        ]])

In [3]: res.B
Out[3]: 
array([[ 0.00757566,  0.        ,  0.        ],
       [ 0.        ,  0.00512051,  0.        ],
       [ 0.        ,  0.        ,  0.02070894]])

This also produces a SVAR irf, resulting from a impulse shock to realgdp:

We can check our work by comparing our work to comparable packages (see below) and also performing some simple calculations. If we calculated A and B correctly, then:

$\Sigma_u = A^{-1}BB(A^{-1})'$

We find:

In [10]: P = np.dot(npl.inv(res.A), res.B)

In [11]: np.dot(P,P.T)
Out[11]: 
array([[  5.73906806e-05,   2.90857141e-05,   2.29263262e-04],
       [  2.90857141e-05,   4.09603703e-05,   3.64524316e-05],
       [  2.29263262e-04,   3.64524316e-05,   1.58721639e-03]])

In [12]: res.sigma_u
Out[12]: 
array([[  5.73907395e-05,   2.90857557e-05,   2.29263451e-04],
       [  2.90857557e-05,   4.09604397e-05,   3.64524456e-05],
       [  2.29263451e-04,   3.64524456e-05,   1.58721761e-03]])

Through much trial, it seems like the 'bfgs' estimator is the best for solving the maximum likelihood problem. I would like to investigate in the future why this may be. Estimating the likelihood was aided greatly by bringing in the likelihood methods already present in the base component of the statsmodels package.

This equivalent system can be estimated and plotted in the R [2] script:

library("vars")

#data <- read.csv("/home/skipper/statsmodels/statsmodels-skipper/scikits/statsmodels/datasets/macrodata/macrodata.csv")
data <- read.csv("/home/bart/statsmodels/scikits/statsmodels/datasets/macrodata/macrodata.csv")
names <- colnames(data)
data <- log(data[c('realgdp','realcons','realinv')])
data <- sapply(data, diff)
data = ts(data, start=c(1959,2), frequency=4)

amat <- matrix(0,3,3)
amat[1,1] <- 1
amat[2,1] <- NA
amat[3,1] <- NA
amat[2,2] <- 1
amat[3,2] <- NA
amat[3,3] <- 1
bmat <- diag(3)
diag(bmat) <- NA
svar <- SVAR(var, estmethod = 'scoring', Bmat=bmat, Amat=amat)
plot(irf(svar, n.ahead=30, impulse = 'realgdp'))
#myirf <- plot(irf(myvar, impulse = "realgdp", response = c("realgdp", "realcons", "realinv"), boot=TRUE, n.ahead=30, ci=0.95))
#plot.irf()

This is the end of my Google Summer of Code project. In the future, I hope to continue work on SVAR, bring in long-run restrictions ala Blanchard and Quah, and further test solvers and their performance. I have benefited a lot from this project and would like to sincerely thank my mentors: Skipper Seabold, Josef Perktold, and Alan Isaac. They have been a great support and have answered my questions swiftly and completely. All blame for failure to complete the goals I set for myself at the beginning of the summer rests on me alone. Not only have I learned a lot about time series econometrics, but also quite a bit about how community software development works and especially realistic timelines. This has been an invaluable experience and I plan to further improve my contributions to the project in the coming year.

Bart Baker

[1] Hamilton, James. 1994. Time Series Analysis. Princeton University Press: Princeton.
[2] R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R FOundation for Statistical Copmuting.

Tuesday, July 19, 2011

Sims and Zha IRF Error Bands

After completing the Monte Carlo error bands, I moved onto integrating Sims and Zha error bands into the statsmodels package.These are based on pages 1127 to 1129 of Chris Sims and Tao Zha's 1999 Econometrica (Vol. 67 No. 5) article, "Error Bands for Impulse Responses."

This method took a long time just to get my head around and a lot of trial and error. While Sims and Zha focus on Bayesian sampling methods from the joint distribution of the coefficients and covariance matrix to generate the draws of the MA(n) representations, the method used here to generate these draws are simpler Monte Carlo simulations. In order for these error bands to truly follow the prescription of Sims and Zha (SZ), the Bayesian sampling methods would need to be employed.

Here's a quick overview of the theory.

Given a covariance matrix Sigma, we can perform eigenvalue decomposition as such:

$W{\Lambda}W'=\Sigma$

where Lambda is diagonal and each diagonal element of Lambda corresponds to an eigenvalue of Sigma. Column 'k' of W is the eigenvector corresponding to the 'k'th diagonal element of Lambda or the 'k'th eigenvector.

We will also define our moving average representation or impulse response function as:

$c_{ij}$

where c is the time series vector ('t' to 't+h') response of variable 'i' to a shock in period 't' to variable 'j'.

SZ make the case when the time series model (VAR) is fit to data that is not smooth (differenced, etc.), most of the variation will be contained in the principal components of W. With this information, SZ propose three methods for characterizing the uncertainty around the data.

The following three arrays of graphs can be produced by the following code:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)
mymodel = VAR(data,names=names)
res = mymodel.fit(maxlags=3,ic=None)

res.irf(periods=20).plot(impulse='realgdp', stderr_type='sz1', repl=1000, seed=30)
res.irf(periods=20).plot(impulse='realgdp', stderr_type='sz2',repl=1000, seed=30)
res.irf(periods=30).plot(impulse='realgdp', stderr_type='sz3', repl=1000, seed=30)
plt.show()

At this point, I have only implemented this for the non-orthogalized impulse responses, but Sims and Zha explicitly address this case in their paper and it is analogous to the methods defined below.

1) Symmetric, assumes Gaussian uncertainty. These error bands add and subtract the estimated impulse response functions with error completely defined by the principal component¹:

$c_{ij}\pm W_{\cdot{k}}(t)\sqrt{\lambda_k}$

where W_{.k} is the column of W corresponding to the 'k'th eigenvalue of Sigma. The above equation would be the 68% probability bands. 95% would simply be mulitplied by a scalar of 1.96 on both sides.

In our three variable model, this method produces the following error bands a response to a gdp shock:

Looks nice enough. These are defined as implemented to be symmetric, so I wasn't too worried about these.

2) Non-symmetric error bands generated by Monte Carlo draws where only covariance between time but not across variables exists.

In this case, instead of assuming Gaussian uncertainty, we retain the draws used to estimate the initial covariance matrix, Sigma, and for each of the these draws we calculate the vector gamma_k:

$\gamma_k=W_{k\cdot }\times c_{ij}$

where W_{k.} is the 'k'th row of W, and k refers to the largest eigenvalue of Sigma. Using these quantiles of the individual elements of gamma_k across the MC draws, we can generate 68% probability bands as follows:

$\hat{c}_{ij}+\gamma_{k,.16}, \hat{c}_{ij}+\gamma_{k,.84}$

where the subscripts on gamma_k refer to the 16th and 84th percentile of the gamma draws. In our three variable case, this produces the following graphical representation:

We can see that in this case the uncertainty clearly drops precipitously once we hit a certain t in the time series representation.

Just for completeness, if we look at the response of the same variables to a shock to real consupmtion, we notice that this method does not even guarantee that the probability bands contain the estimated impulse response function:

While this was extremely worrisome at first, Sims and Zha do give some examples where the probability bands do not contain the estimated impulse response function for certain 't'. These error bands are supposed to give the researcher an idea as to the symmetry (or lackthereof) of the posterior distribution of the impulse response functions.

3) Non-symmetric error bands generated by Monte Carlo draws where covariance between time and between variables is identified.

Here, instead of treating each vector of responses individually, we consider each set of impulse response functions as a single system. Sims and Zha note that while in most cases the majority of the covariance will be between intertemporal observations of a single variable, considering inter-variable time series covariance may be valuable in certain situations.

In order to investigate how different c_ij related to each other, we stack each set of impulse response functions that respond to a single shock j and then compute the eigenvector decomposition of this stacked vector for each MC draw. This allows us to compute eigenvectors that contain the variation both across time periods and across variables. We calculate our gamma_k in an analagous manner to the second method and add the appropriate gamma_k quantiles to the estimated response.

It happens that in our system, the cross-variable covariance does not reveal much addition information about a shock to GDP:

It seems as though the variation in real GDP dominates the variation in the other variables.

This can also be seen by examining the response of real GDP to all three shocks:

Altogether these methods are meant to be used to examine the characteristics of the time series representation of the data. The first Sims and Zha method would most likely be the one to publish in a paper, but the other two methods give the researcher a more complete picture of the nature of the posterior distribution.

A few small things still need to be worked out for the Sims and Zha methods to be a complete part of the VAR package. As I mentioned before, they still need to be implemented for orthogonalized IRFs, but this will not be difficult (they are very clear in their paper in moving towards this implementation). Also, it will be important to bring in user choice when it comes to which principal component to use when characterizing the data. For Sims and Zha error bands 1 and 2, the user can pass in a matrix of integers that correspond to the chosen principal components of the variance-covariance matrix.

Scheduling here on out (let me know what you think, updated):

1) I'd like to completely integrate Sims and Zha error bands. Specifically, this means:

a) Component choice for SZ3

b) Orthogonalized error bands

c) Clean up code

I will aim to finish the above tasks by this Saturday (7/23).

2) After this, I would like to move on to structural VAR implementation. This in itself will feed back on the error band methods that I have been working on. SVAR will draw from the ML methods already present in statsmodels. I'd like to finish a SVAR estimation method by the August 3rd.

3) Once SVAR is completely integrated into the package, per Skipper's suggestion, I will be using pure Python to generalize the Kalman filter. I'll have more questions once I reach that point.

I'd really like to move much quicker than I did in the first half of GSOC and hit these goals.

Bart

^{1. SZ suggest using the largest eigenvalue(s) of Lambda, as they will most likely identify the majority of the variation in this type of data.↩}

Saturday, June 11, 2011

Monte Carlo Standard Errors for Impulse Response Functions

I've come to a major checkpoint in integrating Monte Carlo error bands
for impulse response functions (this is only non-orthogonal right now).

Here is some quick code to get VAR IRF MC standard errors:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
#mdata = mdata[['realgdp','realcons','realinv', 'pop', 'realdpi', 'tbilrate']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)
mymodel = VAR(data,names=names)
res = mymodel.fit(maxlags=3,ic=None)

#first generate asymptotic standard errors (to compare)
res.irf(periods=30).plot(orth=False, stderr_type='asym')
#then generate monte carlo standard errors
res.irf(periods=30).plot(orth=False, stderr_type='mc', seed=29, repl=1000)
plt.show()

Produces the following plots of a shock to (delta) realgdp.

Asymptotic:

Monte Carlo:

I added functions to the VARResults, IRAnalysis, and also added to
some of the pre-existing functions in these classes and also to
plotting.py. Because Monte Carlo standard errors are in general not
symmetric, I had to alter the plot_with_error function from the
tsa.vector_ar.plotting.py file and number of other functions.

Functions added:

VARResults.stderr_MC_irf(self, orth=False, repl=1000, T=25,
signif=0.05, seed=None)
This function generates a tuple that holds the lower and upper error
bands generated by Monte Carlo simulations.

IRAnalysis.cov_mc(self, orth=False, repl=1000, signif=0.05, seed=None)
This just calls the stderr_MC_irf function from the original model
using the number of periods specified when the irf() class is called
from the VAR class.

Modified functions:

BaseIRAnalysis.plot
Added specification of error type stderr_type='asym' or 'mc' (see
example). Also repl (replications) and seed options added if 'mc'
errors are specified.

in tsa.vector_ar.plotting.py
plot_with_error()
irf_grid_plot()

These functions now take the error type specified in the plot()
function above and treats the errors accordingly. Because of how all
of the VAR work (esp. plotting) assumed asymptotic standard errors,
all of these irf plot functions assumed that errors were passed in as
a single matrix with each standard error depending on the MA lag
length and shock variable. Now, depending on which error is specified,
the function will take a tuple of arrays as the standard error if
(stderr_type='mc') rather than a single array.

A serious issue right now is speed. While the asymptotic standard
errors take about a half second on my home laptop to run, the Monte
Carlo standard errors with 3 variables and 1000 replications takes
about 13 seconds to run. Each replication discards the first 100
observations. The most taxing aspect of generating the errors is
re-simulating the data assuming normally distributed standard errors
1000 times (using the util.varsim() function).

Bart

Monday, May 23, 2011

5/6 - 5/20: Getting Acclimated and VARRresults.reorder() function

Just finished up the official first two weeks of GSoC. I spent most of the first week attempting to get a feel for how all of the time series methods for statsmodels are organized. Once I felt comfortable within the VAR package, I began work on adding a reorder() function to the VARResults class, which allows the user to specify the order of the endogenous variables in a vector auto-regression system. While the order of the variables plays no role in estimating the system, if the shocks are to be identified, variable order is used to specify the within period impact of shocks to individual variables in the system.

For example, let us say that we have a 3-variable VAR system, originally ordered realGDP, realcons, and realinv, contained within the VARResults class 'res'. Reordering the variables in the system is as simple as follows:

In [1]: import scikits.statsmodels.api as sm

In [2]: import numpy as np

In [3]: mdata = sm.datasets.macrodata.load().data

In [4]: mdata = mdata[['realgdp','realcons','realinv']]

In [5]: names = mdata.dtype.names

In [6]: data  = mdata.view((float,3))

In [7]: from scikits.statsmodels.tsa.api import VAR

In [8]: res = VAR(data, names=names).fit(maxlags=3,ic=None)

In [9]: res.names
Out[9]: ['realgdp', 'realcons', 'realinv']

In [10]: res_re = res.reorder(['realinv','realcons','realgdp'])

In [11]: res_re.names
Out[11]: ['realinv', 'realcons', 'realgdp']

The reorder function reuses all of the results from the original VAR class, but rearranges them to be in line with the new system. If working with large a # of observations, the computational advantage becomes useful pretty quickly. For example, with a 100,000 observation system with three variables, re-estimating the system after changing the variable order took 3.37 seconds, while using the reorder function took 0.57 seconds.

In the next few weeks I am planning on adding more impulse response function error band estimation methods. The current package only includes analytical error bands.

Saturday, April 23, 2011

Statsmodels: GSoC Prelim

This is my first entry in my Statsmodels Project Summer 2011 blog. I will update this blog weekly as the summer goes by with information regarding progress on my work for the scikits.statsmodels Python library.

In order to complete the preparation process for the statsmodels Google Summer of Code sponsorship, I wrote a quick patch that included a cointegration test. As of now, the test can only be run on a bivariate system with a simple Dickey-Fuller test on the residuals, using the Mackinnon [1] critical values. I would like to expand the functionality of this test to allow for Augmented Dickey-Fuller tests of the residuals and also tests of multivariate cointegrated systems. This patch served as a nice way to dive into what time series methods scikits.statsmodels currently includes in its toolbox and where to go from here.

Scipy Stats Project