Scipy Stats Project

SVAR Estimation

2011-08-22T23:51:00.000-04:00

For the next phase of my GSoC project, I integrated SVAR estimation with restrictions on the within period effects and shock identification. This follows the method outlined in section 11.6 of Hamilton (1994) [1]. In order to show how the system works, I will work through an example.

Structural VARs of the type I will be positing here are best suited to situations where impulse response function shocks need to be identified, but there is little to no theoretical justification for the within period dynamics required for orthogonalization. While this allows greater flexibility to the researcher in specifying the model, it also leaves greater room for error, especially when specifying what parameters can be assumed zero and what should be estimated, and whether or not this specification meets the criteria for shock identification.

Given time series data, SVAR class, initiation requires (at a minimum):

1) svar_type
2) A, B matrices marking with an 'E' or 'e' where parameters are unknown

This must be A,B, or AB. An 'A' system assumed that the matrix premulitplying the error matrix is the identity matrix and vice-versa for the 'B' system. In an 'AB' system, elements in both the A and B matrix need to be estimated. The system can be summarized as:

We estimate A, B in a second stage after A_1-A_n, /Sigma_u are estimated via OLS, by maximizing the log-likelihood:

We could set up an example system as:

In [1]: import numpy as np

In [2]: A = np.array([[1, 'E', 0], [1, 'E', 'E'], [0, 0, 1]])

In [3]: A
Out[3]: 
array([['1', 'E', '0'],
       ['1', 'E', 'E'],
       ['0', '0', '1']], 
      dtype='|S1')

In [4]: B = np.array([['E', 0, 0], [0, 'E', 0], [0, 0, 'E']])

In [5]: B
Out[5]: 
array([['E', '0', '0'],
       ['0', 'E', '0'],
       ['0', '0', 'E']], 
      dtype='|S1')

In order to aid numerical maximum likelihood estimation, the SVAR class fit can also be passed guess matrices for both A and B parameters.

Building upon the previously used three variable macro system, a simple example of an AB system estimation is as follows:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR, SVAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)

#define structural inputs
A = np.asarray([[1, 0, 0],['E', 1, 0],['E', 'E', 1]])
B = np.asarray([['E', 0, 0], [0, 'E', 0], [0, 0, 'E']])
A_guess = np.asarray([0.5, 0.25, -0.38])
B_guess = np.asarray([0.5, 0.1, 0.05])
mymodel = SVAR(data, svar_type='AB', A=A, B=B, names=names)
res = mymodel.fit(maxlags=3, maxiter=10000, maxfun=10000, solver='bfgs')
res.irf(periods=30).plot(impulse='realgdp', plot_stderr=False)

From here, we can access the estimates of both A and B:

In [2]: res.A
Out[2]: 
array([[ 1.        ,  0.        ,  0.        ],
       [-0.50680204,  1.        ,  0.        ],
       [-5.53605672,  3.04117688,  1.        ]])

In [3]: res.B
Out[3]: 
array([[ 0.00757566,  0.        ,  0.        ],
       [ 0.        ,  0.00512051,  0.        ],
       [ 0.        ,  0.        ,  0.02070894]])

This also produces a SVAR irf, resulting from a impulse shock to realgdp:

We can check our work by comparing our work to comparable packages (see below) and also performing some simple calculations. If we calculated A and B correctly, then:

We find:

In [10]: P = np.dot(npl.inv(res.A), res.B)

In [11]: np.dot(P,P.T)
Out[11]: 
array([[  5.73906806e-05,   2.90857141e-05,   2.29263262e-04],
       [  2.90857141e-05,   4.09603703e-05,   3.64524316e-05],
       [  2.29263262e-04,   3.64524316e-05,   1.58721639e-03]])

In [12]: res.sigma_u
Out[12]: 
array([[  5.73907395e-05,   2.90857557e-05,   2.29263451e-04],
       [  2.90857557e-05,   4.09604397e-05,   3.64524456e-05],
       [  2.29263451e-04,   3.64524456e-05,   1.58721761e-03]])

Through much trial, it seems like the 'bfgs' estimator is the best for solving the maximum likelihood problem. I would like to investigate in the future why this may be. Estimating the likelihood was aided greatly by bringing in the likelihood methods already present in the base component of the statsmodels package.

This equivalent system can be estimated and plotted in the R [2] script:

library("vars")

#data <- read.csv("/home/skipper/statsmodels/statsmodels-skipper/scikits/statsmodels/datasets/macrodata/macrodata.csv")
data <- read.csv("/home/bart/statsmodels/scikits/statsmodels/datasets/macrodata/macrodata.csv")
names <- colnames(data)
data <- log(data[c('realgdp','realcons','realinv')])
data <- sapply(data, diff)
data = ts(data, start=c(1959,2), frequency=4)

amat <- matrix(0,3,3)
amat[1,1] <- 1
amat[2,1] <- NA
amat[3,1] <- NA
amat[2,2] <- 1
amat[3,2] <- NA
amat[3,3] <- 1
bmat <- diag(3)
diag(bmat) <- NA
svar <- SVAR(var, estmethod = 'scoring', Bmat=bmat, Amat=amat)
plot(irf(svar, n.ahead=30, impulse = 'realgdp'))
#myirf <- plot(irf(myvar, impulse = "realgdp", response = c("realgdp", "realcons", "realinv"), boot=TRUE, n.ahead=30, ci=0.95))
#plot.irf()

This is the end of my Google Summer of Code project. In the future, I hope to continue work on SVAR, bring in long-run restrictions ala Blanchard and Quah, and further test solvers and their performance. I have benefited a lot from this project and would like to sincerely thank my mentors: Skipper Seabold, Josef Perktold, and Alan Isaac. They have been a great support and have answered my questions swiftly and completely. All blame for failure to complete the goals I set for myself at the beginning of the summer rests on me alone. Not only have I learned a lot about time series econometrics, but also quite a bit about how community software development works and especially realistic timelines. This has been an invaluable experience and I plan to further improve my contributions to the project in the coming year.

Bart Baker

[1] Hamilton, James. 1994. Time Series Analysis. Princeton University Press: Princeton.
[2] R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R FOundation for Statistical Copmuting.

Sims and Zha IRF Error Bands

2011-07-19T00:29:00.001-04:00

After completing the Monte Carlo error bands, I moved onto integrating Sims and Zha error bands into the statsmodels package.These are based on pages 1127 to 1129 of Chris Sims and Tao Zha's 1999 Econometrica (Vol. 67 No. 5) article, "Error Bands for Impulse Responses."

This method took a long time just to get my head around and a lot of trial and error. While Sims and Zha focus on Bayesian sampling methods from the joint distribution of the coefficients and covariance matrix to generate the draws of the MA(n) representations, the method used here to generate these draws are simpler Monte Carlo simulations. In order for these error bands to truly follow the prescription of Sims and Zha (SZ), the Bayesian sampling methods would need to be employed.

Here's a quick overview of the theory.

Given a covariance matrix Sigma, we can perform eigenvalue decomposition as such:

where Lambda is diagonal and each diagonal element of Lambda corresponds to an eigenvalue of Sigma. Column 'k' of W is the eigenvector corresponding to the 'k'th diagonal element of Lambda or the 'k'th eigenvector.

We will also define our moving average representation or impulse response function as:

where c is the time series vector ('t' to 't+h') response of variable 'i' to a shock in period 't' to variable 'j'.

SZ make the case when the time series model (VAR) is fit to data that is not smooth (differenced, etc.), most of the variation will be contained in the principal components of W. With this information, SZ propose three methods for characterizing the uncertainty around the data.

The following three arrays of graphs can be produced by the following code:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)
mymodel = VAR(data,names=names)
res = mymodel.fit(maxlags=3,ic=None)

res.irf(periods=20).plot(impulse='realgdp', stderr_type='sz1', repl=1000, seed=30)
res.irf(periods=20).plot(impulse='realgdp', stderr_type='sz2',repl=1000, seed=30)
res.irf(periods=30).plot(impulse='realgdp', stderr_type='sz3', repl=1000, seed=30)
plt.show()

At this point, I have only implemented this for the non-orthogalized impulse responses, but Sims and Zha explicitly address this case in their paper and it is analogous to the methods defined below.

1) Symmetric, assumes Gaussian uncertainty. These error bands add and subtract the estimated impulse response functions with error completely defined by the principal component¹:

where W_{.k} is the column of W corresponding to the 'k'th eigenvalue of Sigma. The above equation would be the 68% probability bands. 95% would simply be mulitplied by a scalar of 1.96 on both sides.

In our three variable model, this method produces the following error bands a response to a gdp shock:

Looks nice enough. These are defined as implemented to be symmetric, so I wasn't too worried about these.

2) Non-symmetric error bands generated by Monte Carlo draws where only covariance between time but not across variables exists.

In this case, instead of assuming Gaussian uncertainty, we retain the draws used to estimate the initial covariance matrix, Sigma, and for each of the these draws we calculate the vector gamma_k:

where W_{k.} is the 'k'th row of W, and k refers to the largest eigenvalue of Sigma. Using these quantiles of the individual elements of gamma_k across the MC draws, we can generate 68% probability bands as follows:

where the subscripts on gamma_k refer to the 16th and 84th percentile of the gamma draws. In our three variable case, this produces the following graphical representation:

We can see that in this case the uncertainty clearly drops precipitously once we hit a certain t in the time series representation.

Just for completeness, if we look at the response of the same variables to a shock to real consupmtion, we notice that this method does not even guarantee that the probability bands contain the estimated impulse response function:

While this was extremely worrisome at first, Sims and Zha do give some examples where the probability bands do not contain the estimated impulse response function for certain 't'. These error bands are supposed to give the researcher an idea as to the symmetry (or lackthereof) of the posterior distribution of the impulse response functions.

3) Non-symmetric error bands generated by Monte Carlo draws where covariance between time and between variables is identified.

Here, instead of treating each vector of responses individually, we consider each set of impulse response functions as a single system. Sims and Zha note that while in most cases the majority of the covariance will be between intertemporal observations of a single variable, considering inter-variable time series covariance may be valuable in certain situations.

In order to investigate how different c_ij related to each other, we stack each set of impulse response functions that respond to a single shock j and then compute the eigenvector decomposition of this stacked vector for each MC draw. This allows us to compute eigenvectors that contain the variation both across time periods and across variables. We calculate our gamma_k in an analagous manner to the second method and add the appropriate gamma_k quantiles to the estimated response.

It happens that in our system, the cross-variable covariance does not reveal much addition information about a shock to GDP:

It seems as though the variation in real GDP dominates the variation in the other variables.

This can also be seen by examining the response of real GDP to all three shocks:

Altogether these methods are meant to be used to examine the characteristics of the time series representation of the data. The first Sims and Zha method would most likely be the one to publish in a paper, but the other two methods give the researcher a more complete picture of the nature of the posterior distribution.

A few small things still need to be worked out for the Sims and Zha methods to be a complete part of the VAR package. As I mentioned before, they still need to be implemented for orthogonalized IRFs, but this will not be difficult (they are very clear in their paper in moving towards this implementation). Also, it will be important to bring in user choice when it comes to which principal component to use when characterizing the data. For Sims and Zha error bands 1 and 2, the user can pass in a matrix of integers that correspond to the chosen principal components of the variance-covariance matrix.

Scheduling here on out (let me know what you think, updated):

1) I'd like to completely integrate Sims and Zha error bands. Specifically, this means:

a) Component choice for SZ3

b) Orthogonalized error bands

c) Clean up code

I will aim to finish the above tasks by this Saturday (7/23).

2) After this, I would like to move on to structural VAR implementation. This in itself will feed back on the error band methods that I have been working on. SVAR will draw from the ML methods already present in statsmodels. I'd like to finish a SVAR estimation method by the August 3rd.

3) Once SVAR is completely integrated into the package, per Skipper's suggestion, I will be using pure Python to generalize the Kalman filter. I'll have more questions once I reach that point.

I'd really like to move much quicker than I did in the first half of GSOC and hit these goals.

Bart

^{1. SZ suggest using the largest eigenvalue(s) of Lambda, as they will most likely identify the majority of the variation in this type of data.↩}

Monte Carlo Standard Errors for Impulse Response Functions

2011-06-11T14:01:00.001-04:00

I've come to a major checkpoint in integrating Monte Carlo error bands
for impulse response functions (this is only non-orthogonal right now).

Here is some quick code to get VAR IRF MC standard errors:

import numpy as np
import scikits.statsmodels.api as sm
from scikits.statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

mdata = sm.datasets.macrodata.load().data
mdata = mdata[['realgdp','realcons','realinv']]
#mdata = mdata[['realgdp','realcons','realinv', 'pop', 'realdpi', 'tbilrate']]
names = mdata.dtype.names
data = mdata.view((float,3))
data = np.diff(np.log(data), axis=0)
mymodel = VAR(data,names=names)
res = mymodel.fit(maxlags=3,ic=None)

#first generate asymptotic standard errors (to compare)
res.irf(periods=30).plot(orth=False, stderr_type='asym')
#then generate monte carlo standard errors
res.irf(periods=30).plot(orth=False, stderr_type='mc', seed=29, repl=1000)
plt.show()

Produces the following plots of a shock to (delta) realgdp.

Asymptotic:

Monte Carlo:

I added functions to the VARResults, IRAnalysis, and also added to
some of the pre-existing functions in these classes and also to
plotting.py. Because Monte Carlo standard errors are in general not
symmetric, I had to alter the plot_with_error function from the
tsa.vector_ar.plotting.py file and number of other functions.

Functions added:

VARResults.stderr_MC_irf(self, orth=False, repl=1000, T=25,
signif=0.05, seed=None)
This function generates a tuple that holds the lower and upper error
bands generated by Monte Carlo simulations.

IRAnalysis.cov_mc(self, orth=False, repl=1000, signif=0.05, seed=None)
This just calls the stderr_MC_irf function from the original model
using the number of periods specified when the irf() class is called
from the VAR class.

Modified functions:

BaseIRAnalysis.plot
Added specification of error type stderr_type='asym' or 'mc' (see
example). Also repl (replications) and seed options added if 'mc'
errors are specified.

in tsa.vector_ar.plotting.py
plot_with_error()
irf_grid_plot()

These functions now take the error type specified in the plot()
function above and treats the errors accordingly. Because of how all
of the VAR work (esp. plotting) assumed asymptotic standard errors,
all of these irf plot functions assumed that errors were passed in as
a single matrix with each standard error depending on the MA lag
length and shock variable. Now, depending on which error is specified,
the function will take a tuple of arrays as the standard error if
(stderr_type='mc') rather than a single array.

A serious issue right now is speed. While the asymptotic standard
errors take about a half second on my home laptop to run, the Monte
Carlo standard errors with 3 variables and 1000 replications takes
about 13 seconds to run. Each replication discards the first 100
observations. The most taxing aspect of generating the errors is
re-simulating the data assuming normally distributed standard errors
1000 times (using the util.varsim() function).

Bart

5/6 - 5/20: Getting Acclimated and VARRresults.reorder() function

2011-05-23T09:55:00.001-04:00

Just finished up the official first two weeks of GSoC. I spent most of the first week attempting to get a feel for how all of the time series methods for statsmodels are organized. Once I felt comfortable within the VAR package, I began work on adding a reorder() function to the VARResults class, which allows the user to specify the order of the endogenous variables in a vector auto-regression system. While the order of the variables plays no role in estimating the system, if the shocks are to be identified, variable order is used to specify the within period impact of shocks to individual variables in the system.

For example, let us say that we have a 3-variable VAR system, originally ordered realGDP, realcons, and realinv, contained within the VARResults class 'res'. Reordering the variables in the system is as simple as follows:

In [1]: import scikits.statsmodels.api as sm

In [2]: import numpy as np

In [3]: mdata = sm.datasets.macrodata.load().data

In [4]: mdata = mdata[['realgdp','realcons','realinv']]

In [5]: names = mdata.dtype.names

In [6]: data  = mdata.view((float,3))

In [7]: from scikits.statsmodels.tsa.api import VAR

In [8]: res = VAR(data, names=names).fit(maxlags=3,ic=None)

In [9]: res.names
Out[9]: ['realgdp', 'realcons', 'realinv']

In [10]: res_re = res.reorder(['realinv','realcons','realgdp'])

In [11]: res_re.names
Out[11]: ['realinv', 'realcons', 'realgdp']

The reorder function reuses all of the results from the original VAR class, but rearranges them to be in line with the new system. If working with large a # of observations, the computational advantage becomes useful pretty quickly. For example, with a 100,000 observation system with three variables, re-estimating the system after changing the variable order took 3.37 seconds, while using the reorder function took 0.57 seconds.

In the next few weeks I am planning on adding more impulse response function error band estimation methods. The current package only includes analytical error bands.

Statsmodels: GSoC Prelim

2011-04-23T15:18:00.003-04:00

This is my first entry in my Statsmodels Project Summer 2011 blog. I will update this blog weekly as the summer goes by with information regarding progress on my work for the scikits.statsmodels Python library.

In order to complete the preparation process for the statsmodels Google Summer of Code sponsorship, I wrote a quick patch that included a cointegration test. As of now, the test can only be run on a bivariate system with a simple Dickey-Fuller test on the residuals, using the Mackinnon [1] critical values. I would like to expand the functionality of this test to allow for Augmented Dickey-Fuller tests of the residuals and also tests of multivariate cointegrated systems. This patch served as a nice way to dive into what time series methods scikits.statsmodels currently includes in its toolbox and where to go from here.

Statsmodels: GSoC Week 4 Update

2010-06-22T10:42:00.000-04:00

I spent the last week finishing up the paper that I submitted to accompany my talk at the SciPy conference. I am really looking forward to going to Austin and hearing all the great talks (plus I hear the beer is cheap and the food and music are good, which doesn't hurt). In addition to finishing up the paper, I have started to clean up our time series code.

So far this has included finishing the augmented Dickey-Fuller (ADF) test for unit roots. The big time sink here is that the ADF test-statistic has a non-standard distribution in most cases. The ADF test statistic is obtained by running the following regression

One approach to testing for a unit root means testing the t-stat on the coefficient on the lagged level of y. The actual distribution for this statistic, however, is not Student's t. Many software packages use the tables in Fuller (1976, updated to 1996 version or not) in order to get the critical values for the test statistic depending on the sample size. They use linear interpolation for sample sizes not included in the table. The p-values for the obtained test statistic are usually obtained using MacKinnon's (1994) study that estimated regression surfaces of these distributions via Monte Carlo simulation.

While we do use MacKinnon's approximate p-values from the 1994 paper, MacKinnon wrote a note updating this paper in early 2010, which gives new regression surface results for obtaining the critical values. We use these new results for the critical values. Therefore, when using our ADF test, it is advised that if the p-value is close to the reject/accept region then the critical values should be used in place of the p-value to make the ultimate decision.

We can illlustrate the use of ADF. Note that this version is only in my branch and that it is still in the sandbox, even though it has now been tested, because the API and returned results may change. We will demonstrate on a series that we can easily guess is non-stationary, real GDP.

In [1]: import scikits.statsmodels as sm

In [2]: from scikits.statsmodels.sandbox.tsa.stattools import adfuller

In [3]: data = sm.datasets.macrodata.load()

In [4]: realgdp = data.data['realgdp']

In [5]: adf = adfuller(realgdp, maxlag=4, autolag=None, regression="ct")

In [6]: adf
Out[6]: 
(-1.8566384063254346,
 0.67682917510440099,
 4,
 198,
 {'1%': -4.0052351400496136,
  '10%': -3.1402115863254525,
  '5%': -3.4329000694218998})

The return values are the test statistic, its p-value (the null-hypothesis here is that the series does contain a unit root), the number of lags of the differences used, the number of observations for the regression, and a dictionary containing the critical values at the respective confidence levels. The regression option controls the type of regression (ie., whether to include a constant or a linear or quadratic time trend), and the autolag option has three options for choosing the lag length to help correct for serial correlation in the regression. There are 'AIC', 'BIC', and 't-stat'. The former two choose the lag length that maximizes the infofrmation criterion, the latter chooses the lag length based on the significance of the lag. This starts with maxlag and works its way down. The docstring has more detailed information.

Beyond this, I have been working on an autocorrelation function (acf), a partial autocorrelation function (pacf), and Q-Statistics (Box-Ljung test). Next up for this week is finishing my VAR class with identification schemes. After this, I will work to integrate post-estimation tests into our results classes, most likely using some sort of mix-in classes and attach test containers to the results objects for test results. Then it's off to the SciPy conference. There I will hopefully be participating in the stats sprint, helping out with the docs marathon and discussing what we need for the future of statistics and Python.

Fuller, W.A.  1996.  Introduction to Statistical Time Series. 2nd ed.  Wiley.
 

MacKinnon, J.G. 1994.  "Approximate asymptotic distribution functions for 
    unit-root and cointegration tests.  Journal of Business and Economic
    Statistics 12, 167-76.

MacKinnon, J.G. 2010. "Critical Values for Cointegration Tests."  
    Queen's University, Dept of Economics, Working Papers.  Available at
    http://ideas.repec.org/p/qed/wpaper/1227.html

Statsmodels: GSoC Week 3 Update

2010-06-11T19:11:00.004-04:00

[Edit: Formatting should be fixed now. I will not be reformatting old posts though, so that they don't get reposted at Planet SciPy]

Last week was spent mainly ensuring that I pass my comps and remain a PhD student. This week was much more productive for coding. For now, all changes are in my branch and have not been merged to trunk, but I will describe the two big changes.

The first concerns the datasets package. This one is not all that exciting, but suffice it to say that the datasets are now streamlined and use the Bunch pattern to load the data. Thanks, Gaël, for pointing this out. I also rewrote a bit of David's datasets proposal from scikits-learn to reflect the current design of our datasets and thoughts. You can see it here (soon to be on the docs page). We are making an effort to ensure that our datasets are going to be similar to those of scikits-learn.

The second change was an improvement of the fitting of maximum likelihood models and the start of a GenericLikelihoodModel class. Maximum likelihood based models (mainly discrete choice models in the main code base right now) can now be fit using any of the unconstrained solvers from scipy.optimize (Nelder-Mead, BFGS, CG, Newton-CG, Powell) plus Newton-Raphson. To take a simple example to see how it works, we can fit a Probit model.

In [1]: import scikits.statsmodels as sm

In [2]: data = sm.datasets.spector.load()

In [3]: data.exog = sm.add_constant(data.exog)

In [4]: res_newton = sm.Probit(data.endog, data.exog).fit(method="newton")
Optimization terminated successfully.
Current function value: 12.818804
Iterations 6

In [5]: res_nm = sm.Probit(data.endog, data.exog).fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [6]: res_bfgs = sm.Probit(data.endog, data.exog).fit(method="bfgs")
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 15
Function evaluations: 21
Gradient evaluations: 21

In [7]: res_cg = sm.Probit(data.endog, data.exog).fit(method="cg", maxiter=250)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 188
Function evaluations: 428
Gradient evaluations: 428

In [8]: res_ncg = sm.Probit(data.endog, data.exog).fit(method="ncg", avextol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 14
Gradient evaluations: 12
Hessian evaluations: 12

In [9]: res_powell = sm.Probit(data.endog, data.exog).fit(method="powell", ftol=1e-8)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 12
Function evaluations: 568

All of the options for the solvers are available and are documented in the fit method. As you can see, some of the default values need to be changed to ensure (accurate) convergence. The Results objects that are returned have two new attributes.

In [10]: res_powell.mle_retvals
Out[10]:
{'converged': True,
'direc': array([[ 7.06629660e-02, -3.07499922e-03, 5.38418734e-01,
-4.19910465e-01],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[ 1.49194876e+00, -6.64992809e-02, -6.96792443e-03,
-3.22306873e+00],
[ -5.36227277e-02, 1.18544093e-01, -8.75205765e-02,
-2.42149981e+00]]),
'fcalls': 568,
'fopt': 12.818804069990534,
'iterations': 12,
'warnflag': 0}

In [11]: res_powell.mle_settings
Out[11]:
{'callback': None,
'disp': 1,
'fargs': (),
'ftol': 1e-08,
'full_output': 1,
'maxfun': None,
'maxiter': 35,
'optimizer': 'powell',
'retall': 0,
'start_direc': None,
'start_params': [0, 0, 0, 0],
'xtol': 0.0001}

The dict mle_retvals contains all of the values that are returned from the solver if the full_output keyword is True. The dict mle_settings contains all of the arguments passed to the solver, including the defaults so that these can be checked after the fit. Again, all settings and returned values are documented in the fit method and in the results class, respectively.

Lastly, I started a GenericLikelihoodModel class. This is currently unfinished, though the basic idea is laid out. Take again the Probit example above using Lee Spector's educational program data. And assume we didn't have the Probit model from statsmodels. We could use the new GenericLikelihoodModel class. There are two ways (probably more) to proceed. For those comfortable with object oriented programming and inheritance in Python, we could subclass the GenericLikelihoodModel, defining our log-likelihood method.

from scikits.statsmodels import GenericLikelihoodModel as LLM
from scipy import stats
import numpy as np

class MyModel(LLM):
def loglike(self, params):
"""
Probit log-likelihood
"""
q = 2*self.endog - 1
X = self.exog
return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))

Now this model could be fit, using any of the methods that only require an objective function, i.e., Nelder-Mead or Powell.

In [43]: mod = MyModel(data.endog, data.exog)

In [44]: res = mod.fit(method="nm", maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [45]: res_nm.params
Out[45]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

In [46]: res.params
Out[46]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

The main drawback right now is that all statistics that rely on the covariance of the parameters, etc. will use numeric gradients and Hessians, which can lessen that accuracy of those statistics. This can be overcome by providing score and hessian methods as loglike was provided above. Of course, for more complicated likelihood functions this can soon become cumbersome. We are working towards more accurate numerical differentiation and discussing options for automatic or symbolic differentiation.

The main advantage as opposed to just writing your likelihood and passing it to a solver is that you have all of the (growing number of) statistics and tests available to statsmodels right in the generic model.

I would also like to accommodate those who are less familiar with OOP and inheritance in Python. I haven't quite worked out the final design for how this would go yet. Right now, you could do the following, though I don't think it quite meets the less complicated goal.

In [4]: from scikits.statsmodels.model import GenericLikelihoodModel as LLM

In [5]: import scikits.statsmodels as sm

In [6]: from scipy import stats

In [7]: import numpy as np

In [8]:

In [9]: data = sm.datasets.spector.load()

In [10]: data.exog = sm.add_constant(data.exog)

In [11]:

In [12]: def probitloglike(params, endog, exog):
....: """
....: Log likelihood for the probit
....: """
....: q = 2*endog - 1
....: X = exog
....: return np.add.reduce(stats.norm.logcdf(q*np.dot(X,params)))
....:

In [13]: mod = LLM(data.endog, data.exog, loglike=probitloglike)

In [14]: res = mod.fit(method="nm", fargs=(data.endog,data.exog), maxiter=500)
Optimization terminated successfully.
Current function value: 12.818804
Iterations: 439
Function evaluations: 735

In [15]: res.params
Out[15]: array([ 1.62580058, 0.05172931, 1.42632242, -7.45229725])

There are still a few design issues and bugs that need to be worked out with the last example, but the basic idea is there. That's all for now.

Week 1 GSoC Update

2010-05-31T10:33:00.000-04:00

Last week was the first of the Google Summer of Code. I spent most of the week in a Bayesian econometrics class led by John Geweke and studying for a comprehensive exam that I take this week, so progress on statsmodels was rather slow. That said, I have been able to take care of some low hanging fruit.

There are a few name changes:

statsmodels/family -> statsmodels/families
statsmodels/lib/io.py -> statsmodels/iolib/foreign.py

Also Vincent has done a good bit of work on improving our output using the SimpleTable class from econpy. I will post some examples over the coming weeks, but SimpleTable provides an easy way to make tables in ASCII text, HTML, or LaTeX. The SimpleTable class has been moved

statsmodels/sandbox/ouput.py -> statsmodels/iolib/table.py

Beyond the renames, I have removed the soft dependency on RPy for running our tests in favor of hard-coded results, refactored our tests, and added a few additional ones along the way.

We are also making an effort to keep our online documentation synced with the current trunk. The biggest change to our documentation is the addition of a developer's page for those who might like to get involved. As always, please report problems with the docs on either the scipy-user list or join in the discussions of statsmodels, pandas, larry, and other topics on statistics and Python at the pystatsmodels Google group.

Plans for the Summer

2010-05-01T12:59:00.000-04:00

A quick update on the plans for statsmodels over the next few months.

I have been accepted for my second Google Summer of Code, which means that we will have a chance to make a big push to get a lot of our work out of the sandbox, tested, and included in the main code base.

You can see the roadmap on Google's GSoC site here. You might have to log in to view it.

The quick version follows. As far as general issues, I will be getting the code ready for Python 3 and focusing on some design issues including an improved generic maximum likelihood framework, post-estimation testing, variable name handling, and output in text tables, LaTeX, and html. I will then be working to get a lot of our code out of the sandbox. This includes timeseries convenience functions and models such as GARCH, VARMA, Hodrick-Prescott filter, and a state space model that uses the Kalman filter. I will be polishing the systems of equation framework and panel (longitudinal) data estimators. We have also been working on some nonparametric estimators including univariate kernel density estimators and kernel regression estimators. Finally, as part of my coursework I have been working toward (generalized) maximum entropy models that I hope to include as well as some work on the scipy.maxentropy module.

I will give a quick talk on the project for the SciPy Conference in Austin.

It looks like we are set to make a good deal of progress on the code this summer.

Sparse Least Squares Solver

2010-03-08T19:37:00.003-05:00

I have a homework doing some monte carlo experiments of an autoregressive process of order 1, and I thought I would use it as an example to demonstrate the sparse least squares solver that Stefan committed to scipy revision 6251.

All mistakes are mine...

Given an AR(1) process

We can estimate the following autoregressive coefficients.

In [1]: import numpy as np

In [2]: np.set_printoptions(threshold=25)

In [3]: np.random.seed(1)

In [4]: # make 1000 autoregressive series of length 100

In [5]: # y_0 = 0 by assumption

In [6]: samples = np.zeros((100,1000))

In [7]: for i in range(1,100):
...: error = np.random.randn(1,1000)
...: samples[i] = .9 * samples[i-1] + .01 * error
...:

In [8]: from scipy import sparse

In [9]: from scipy.sparse import linalg as spla

In [10]: # make block diagonal sparse matrix of y_t-1

In [11]: # recommended to build as a linked list

In [12]: spX = sparse.lil_matrix((99*1000,1000))

In [13]: for i in range(1000):
....: spX[i*99:(i+1)*99,i] = samples[:-1,i][:,None]
....:

In [14]: spX = spX.tocsr() # convert it to csr for performance

In [15]: # do the least squares

In [16]: retval = spla.isolve.lsqr(spX, samples[1:,:].ravel('F'), calc_var=True)
In [17]: retval[0]
Out[17]:
array([ 0.88347438, 0.8474124 , 0.85282674, ..., 0.91019165,
0.89698465, 0.76895806])

I'm curious if there's any downside to using sparse least squares whenever the RHS of a least squares can be written in block diagonal form.

scikits.statsmodels 0.2.0 release

2010-02-16T14:23:00.001-05:00

While I find no time to blog, I thought I'd post our newest release announcement here.

We are happy to announce the 0.2.0 (beta) release of scikits.statsmodels. This is both a bug-fix and new feature release.

Download
------------

You can easy_install (or PyPI URL:
http://pypi.python.org/pypi/scikits.statsmodels/)

Source downloads: http://sourceforge.net/projects/statsmodels/

Development branches: http://code.launchpad.net/statsmodels

Note that the trunk branch on launchpad is almost always stable and has the most up to date changes since our releases are so few and far between.

Documentation
------------------

http://statsmodels.sourceforge.net/

We invite you to install, kick the tires, and make bug reports and feature requests.

Feedback can either be on scipy-user or the mailing list at
http://groups.google.com/group/pystatsmodels?hl=en
Bug tracker: https://bugs.launchpad.net/statsmodels

Main Changes in 0.2.0
---------------------------------

* Improved documentation and expanded and more examples
* Added four discrete choice models: Poisson, Probit, Logit, and Multinomial Logit.
* Added PyDTA. Tools for reading Stata binary datasets (*.dta) and putting them into numpy arrays.
* Added four new datasets for examples and tests.
* Results classes have been refactored to use lazy evaluation.
* Improved support for maximum likelihood estimation.
* bugfixes
* renames for more consistency
RLM.fitted_values -> RLM.fittedvalues
GLMResults.resid_dev -> GLMResults.resid_deviance

Sandbox
-------------

We are continuing to work on support for systems of equations models, panel data models, time series analysis, and information and entropy econometrics in the sandbox. This code is often merged into trunk as it becomes more robust.

scikits.statsmodels Release Announcement

2009-08-31T21:58:00.003-04:00

We have been working hard to get a release ready for general consumption for the statsmodels code. Well, we're happy to announce that a (very) beta release is ready.

Background
==========

The statsmodels code was started by Jonathan Taylor and was formerly included as part of scipy. It was taken up to be tested, corrected, and extended as part of the Google Summer of Code 2009.

What it is
==========

We are now releasing the efforts of the last few months under the scikits namespace as scikits.statsmodels. Statsmodels is a pure python package that requires numpy and scipy. It offers a convenient interface for fitting parameterized statistical models with growing support for displaying univariate and multivariate summary statistics, regression summaries, and (postestimation) statistical tests.

Main Feautures
==============

* regression: Generalized least squares (including weighted least squares and least squares with autoregressive errors), ordinary least squares.
* glm: Generalized linear models with support for all of the one-parameter exponential family distributions.
* rlm: Robust linear models with support for several M-estimators.
* datasets: Datasets to be distributed and used for examples and in testing.

There is also a sandbox which contains code for generalized additive models (untested), mixed effects models, cox proportional hazards model (both are untested and still dependent on the nipy formula framework), generating descriptive statistics, and printing table output to ascii, latex, and html. None of this code is considered "production ready".

Where to get it
===============

Development branches will be on LaunchPad. This is where to go to get the most up to date code in the trunk branch. Experimental code will also be hosted here in different branches.

https://code.launchpad.net/statsmodels

Source download of stable tags will be on SourceForge.

https://sourceforge.net/projects/statsmodels/

or

PyPi: http://pypi.python.org/pypi/scikits.statsmodels/

License
=======

Simplified BSD

Documentation
=============

The official documentation is hosted on SourceForge.

http://statsmodels.sourceforge.net/

The sphinx docs are currently undergoing a lot of work. They are not yet comprehensive, but should get you started.

This blog will continue to be updated as we make progress on the code.

Discussion and Development
==========================

All chatter will take place on the or scipy-user mailing list. We are very interested in receiving feedback about usability, suggestions for improvements, and bug reports via the mailing list or the bug tracker at https://bugs.launchpad.net/statsmodels.

GSoC Is Over

2009-08-27T00:15:00.004-04:00

Whoa, where did the last month go? The Google Summer of Code 2009 officially ended this Monday. Though I haven't taken a breath to update the blog, we (Josef and I) have been hard at work on the models code.

We have working and tested versions of Generalized Least Squares, Weighted Least Squares, Ordinary Least Squares, Robust Linear Models with several M-estimators, and Generalized Linear Models with support for all (almost all?) one parameter exponential family distributions. We have also provided some more convenience functions, created a standalone python package for the models code, and obtained permissions to distribute a few more datasets. Due to a lack of time, there is only experimental (read untested) support for autoregressive models, mixed effects models, generalized additive models, and convenience functions for returning strings (possibly html and latex output as well) with regression results and descriptive statistics. I will continue to work on these as I find time.

I will soon post a note on the progress that was made in the robust linear models code. Also, look out for a (semi-) official release of the code in the next few days. We have decided to name the project statsmodels and distribute it as a scikit. We need to finalize the documentation (should be ready to go in the next day or so...I am back taking courses) and clean up some of the usage examples, so people can jump right in and use the code, give feedback, and hopefully contribute extensions and new models.

As for the future of statsmodels, we are discussing over the next few weeks the immediate extensions that we know would like to make. It's looking like I will be wearing my microeconometrician hat this semester in my own coursework. More specifically, I will probably be working with cross-sectional and panel data models for household survey data in my own research and finding some time for time series models as part of my teaching assistantship. Josef has also mentioned wanting to work more with time series models.

If anyone (especially those from other disciplines) would like to contribute or see some extensions (my apologies to those who have made requests that I haven't yet been able to accomodate) feel free to post to the scipy-dev mailing list. I'm more than happy to discuss/debate with users and potential developers the design decisions that have been made, as I think the code is still in an unsettled enough state to merit some discussion.

Iterated Reweighted Least Squares

2009-07-25T23:22:00.007-04:00

I have spent the last two weeks putting the "finishing" touches on the generalized linear models and starting to go over the robust linear models (RLM). The test suite for GLM is not complete yet, but all of the exponential families are covered with at least their default link functions tested and are looking good. So in an effort to make a first pass over all of the existing code, I moved on to RLM. After the time spent with GLM theory, the RLMs theory and code was much more manageable.

Before discussing the RLMs, their implementation, and the extensions I have made. I will describe the iterated reweighted least squares (IRLS) algorithm for the GLMs to demonstrate the theory and the solution method in the models code. A very similar iteration is done for the RLMs as well.

The main idea of GLM, as noted, is to relate a response variable to a linear model via a link function, which allows us to use least squares regression. Let us take as an example, the binomial family (which is written to handle Bernoulli and binomial data). In this case, the response variable is Bernoulli, 1 indicates a "success" and 0 a "failure".

The default link for the binomial family is the logit link. So we have

η is our linear predictor and μ is our actual mean response. The first thing that we need for the algorithm is to compute a first guess on μ (IRLS as opposed to Newton-Raphson makes a guess on the mean response rather than the parameter vector β). The algorithm is fairly robust to this first guess; however, a common choice is

For the binomial family, we specifically use

where y is our given response variable. We then use this first guess to initialize our linear predictor η via the link function given above. With these estimates we are able to start the iteration. Our convergence criteria is based on the deviance function, which is simply twice the log-likelihood ratio of our current guess on the fitted model versus the saturated log-likelihood.

Where Φ is a dispersion (scale) parameter. Note that the saturated log-likelihood is simply the likelihood of the perfectly fitted model where y = μ. For the binomial family the deviance function is

The iteration continues while the deviance function evaluated at the updated μ differs from the previous by the given convergence tolerance (default is 1e-08) and the number of iterations is less than the given maximum (default is 100).

The actual iterations, as the name of the algorithm suggests, run a weighted least squares fit of the actual regressors on the adjusted linear predictor (our transformed guess on the response variable). The adjustment is given by

which moves us closer to the root of the estimating equation (see the previously mentioned Gill or Hardin and Hilbe for the details of root-finding using the Newton-Raphson method. IRLS is simply Newton-Raphson using some simplifications.). The only remaining ingredient is the choice of weights for the weighted least squares. Similar to other methods such as general least squares (GLS) that correct for heteroskedasticity in the error terms, a diagonal matrix containing estimates of the variance of the error terms is used. However, in our case, the exact form of this variance is unknown and difficult to estimate, but the error each estimate is assumed to vary with the mean response variable. Thus, we improve the weights at each step given our best guess for the mean response variable μ and the known variance function of each family. For the binomial family, as is well known, the variance function is

Thus we can obtain an estimate of the parameters using weighted least squares. Using these estimates we update η

Correcting by an offset if one is specified (the code currently does not support the use of offsets, but this would be a simple extension). Using this linear predictor (remember it was originally given by the linear transformation η = g(μ)) we update our guess on the mean response variable μ and use this to update our estimate of the deviance function. This is continued until convergence is reached.

Once the fit is obtained, the covariance matrix is obtained as always though in the case of GLM it is weighted by an estimate of the scale chosen when the fit method is called. The default is Pearson's Chi-Squared. The standard errors of the estimate are obtained from the diagonal elements of this matrix.

And that's basically it. Much of my time with the GLM code was spent getting my head around the theory and then this algorithm. Then I had to obtain data (Jeff Gill was kind enough to give permission for SciPy to use and distribute the datasets from his nice monograph) and write tests to ensure that all of the intermediate and final results were correct for each family. This was no small feat considering there are 6 families after I added the negative binomial family and around 25 possible combinations of families and link functions. Figuring out the correct use of the estimated scale (or dispersion) parameters for each family was particularly challenging. As I mentioned, in the interest of time, I haven't written tests for the noncanonical links for each and every family, but the initial results look good and these tests will come.

GLM provided a good base for understanding the remaining code and allowed me to more or less plow through the robust linear models estimator. I had some mathematical difficulties in extending the models to include the correct covariance matrices, since there is no strong theoretical consensus on what they should actually be! More on RLM in my next post, but before then I'll just give a quick look at how the GLM estimator is used.

First, the algorithm described above has been made flexible enough to estimate truly binomial data. That is, we can have a vector with each row containing (# of successes, # of failures) as is the case in the star98 data from Dr. Gill, described and included in the models.datasets. It will be useful to have a look at the syntax for this type of data as it's slightly different than the other families.

In [1]: import models

In [2]: import numpy as np

In [3]: from models.datasets.star98.data import load

In [4]: data = load()

In [5]: data.exog = models.functions.add_constant(data.exog)

In [6]: trials = data.endog[:,:2].sum(axis=1)

In [7]: data.endog[0] # (successes, failures)
Out[7]: array([ 452., 355.])

In [8]: trials[0] # total trials
Out[8]: 807.0

In [9]: from models.glm import GLMtwo as GLM # the name will change

In [10]: bin_model = GLM(data.endog, data.exog, family=models.family.Binomial())

In [11]: bin_results = bin_model.fit(data_weights = trials)

In [12]: bin_results.params
Out[12]:
array([ -1.68150366e-02, 9.92547661e-03, -1.87242148e-02,
-1.42385609e-02, 2.54487173e-01, 2.40693664e-01,
8.04086739e-02, -1.95216050e+00, -3.34086475e-01,
-1.69022168e-01, 4.91670212e-03, -3.57996435e-03,
-1.40765648e-02, -4.00499176e-03, -3.90639579e-03,
9.17143006e-02, 4.89898381e-02, 8.04073890e-03,
2.22009503e-04, -2.24924861e-03, 2.95887793e+00])

The main difference between the above and the rest of the families is that you must specify the total number of "trials" (which as you can see is just the sum of success and failures for each observation) as the data_weights argument to fit. This was done so that the current implementation of the Bernouilli family could be extended without rewriting its other derived functions. The code could easily (and might be) extended to calculate these trials so that this argument doesn't need to be specified, but it's sometimes better to be explicit.

GLM Residuals and The Beauty of Stats with Python + SciPy

2009-07-11T18:10:00.002-04:00

I just finished including the Anscombe residuals for the families in the generalized linear models. The Anscombe residuals for the Binomial family were particularly tricky. It took me a while to work through the math and then figure out the SciPy syntax for what I need (some docs clarification coming...), but if I had had to implement these functions myself (presumably not in Python), it would have taken me more than a week! I thought it provided a good opportunity introduce the residuals and to show off how easy things are in Python with NumPy/SciPy.

Note that there is not really a unified terminology for residual analysis in GLMs. I will try to use the most common names for these residuals in introducing the basics and point out any deviations from the norm both here and in the SciPy documentation.

In general, residuals should be defined such that their distribution is approximately normal. This is achieved through the use of linear equations, transformed linear equations, and deviance contributions. Below you will find the five most common types of residuals that rely mainly on transformations and one that relies on deviance contributions, though there are as many as 9+ different types of residuals in the literature.

Response Residuals

These are simply the observed response values minus the predicted values.

In a classic linear model these are just the expected residuals

However, for GLM, these become

where is the link function that makes our model's linear predictor comparable to response vector.

It is, however, common in GLMs to produce residuals that deviate greatly from the classical assumptions needed in residual analysis, so in addition we have these other residuals which attempt to mitigate deviations from the needed assumptions.

Pearson Residuals

The Pearson residuals are a version of the response residuals, scaled by the standard deviation of the prediction. The name comes from the fact that the sum of the Pearson residuals for a Poisson GLM is equal to Pearson's statistic, a goodness of fit measure.

The scaling allows plotting of these residuals versus an individual predictor or the outcome to identify outliers. The problem with these residuals though is that asymptotically they are normally distributed, but in practice they can be quite skewed leading to a misinterpretation of the actual dispersion.

Working Residuals

These are the difference between the working response and the linear predictor at convergence (ie., the last step in the iterative process).

Anscombe Residuals

Anscombe (1960, 1961) describes a general transformation in place of so that they are as close to a normal distribution as possible. (Note: "There is a maddeningly great diversity of the forms that the Anscombe residuals take in the literature." I have included the simplest as described below. (Gill 54, emphasis added)) The function is given by

This is done for both the response and the predictions. This difference is then scaled by dividing by

so that the Anscombe Residuals are

The Poisson distribution has constant variance so that it's Anscombe Residuals are simply

Easy right? Sure was until I ran into the binomial distribution. The Anscombe residuals are built up in a different way for the binomial family. Indeed, the McCullagh and Nelder text does not even provide the general formula for the binomial Anscombe residuals and refers the reader to Anscombe (1953) and Cox & Snell (1968). The problem is that following this transformation for the binomial distribution leads to a rather nasty solution involving the hypergeometric 2F1 function or equivalently the incomplete beta function multiplied by the beta function as shown by Cox and Snell (1968).

Gill writes "A partial reason that Anscombe residuals are less popular than other forms is the difficulty in obtaining these tabular values." The tabular values to which he is referring are found in Cox and Snell (1968) p. 260. It is a table of the incomplete beta function that was tabulated numerically for an easy reference. How difficult would it be to get this table with NumPy and SciPy?

import numpy as np
from scipy import special
betainc = lambda x: special.betainc(2/3.,2/3.,x)

table = np.arange(0,.5,.001).reshape(50,10)
results = []
for i in table:
results.append(betainc(i))

results = np.asarray(results).reshape(50,10)

That's it!

Now the Anscombe residuals for the binomial distribution are

Where n is the number of trials for each observation (ie., 1, , or )

To implement this in the GLM binomial family class, I defined an intermediate function cox_snell similar to the above. It now looks like

from scipy import special
import numpy as np

def resid_anscombe(self, Y, mu):
cox_snell = lambda x: special.betainc(2/3.,2/3.,x)*special.beta(2/3.,2/3.)
return np.sqrt(self.n)*(cox_snell(Y)-cox_snell(mu))/(mu**(1/6.)*(1-mu)**(1/6.))

We multiply the above formula by np.sqrt(self.n) the square root of the number of trials in order to correct for possible use of proportional outcomes Y and mu.

Also, see the ticket here for a note about the incomplete beta function.

Deviance Residuals

One other important type of residual in GLMs is the deviance residual. The deviance residual is the most general and also the most useful of the GLM residuals. The IRLS algorithm (as will be shown in a future post) depends on the convergence of the deviance function. The deviance residual then is just the increment to the overall deviance of each observation.

where are defined for each family.

Note that most of these residuals also come in variations such as modified, standardized, studentized, and adjusted.

Selected References

Anscombe, FJ (1953) "Contribution to the discussion of H. Hotelling's paper."
   Journal of the Royal Statistical Society, B, 15, 229-30.

Anscombe, FJ (1960) "Rejection of outliers." Technometrics, 2,
  123-47.

Anscombe, FJ (1961) "Examination of residuals." Proceedings of
  the Fourth Berkeley Symposium on Mathematical Statistics and
  Probability. Berkeley: University of California Press.

Cox, DR and Snell, EJ (1968). "A generalized definition of
  residuals." Journal of the Royal Statistical Society B, 30, 248-65.

Test Post

2009-07-11T13:48:00.009-04:00

I am testing out my new ASCIIMathML script, so I can type LaTeX in blogger.

Note that to correctly view this page requires Internet Explorer 6 + MathPlayer or Firefox/Mozilla/Netscape (with MathML fonts). So the equations won't work in a feed reader.

$A(\cdot)=\int\,$VAR$\left[\mu\right]^{-\frac{1}{3}}\, d\mu$

Well it appears to work pretty well, except the \text{} tag doesn't work. Hmm, though it should.

This works.
amath
` text(TEXT) `
endamath

This doesn't work.
$\text{TEXT}$

You can get the ASCIIMathML script here and follow these directions to add a javascript widget.

Another possible solution for typing equations in Blogger relies on typing LaTeX equations into a remote site. This didn't really appeal to me, as I don't foresee needing any overly complicated mathematical formulae, though it seems like a perfectly workable solution. You can read about it here
[Edit: I might switch to this approach, since the equations don't display if you don't have the above mentioned requirements or are in a feed reader.]

Hat tip to Please Make a Note for the pointers.

Generalized Linear Models

2009-07-02T12:04:00.002-04:00

As I have mentioned, I have spent the last few weeks both in stats books, finding my way around R, and cleaning up and refactoring the code for the generalized linear models in the NiPy models code. I have recently hit a wall in this code, so I am trying to clear out some unposted blog drafts. I intended for this post to introduce the generalized linear models approach to estimation; however the full post will have to wait. For now, I will give an introduction to the theory and then explain where I am with the code.

Generalized linear models was a topic that was completely foreign to me a few weeks ago, but after a little (okay a lot of) reading the approach seems almost natural. I have found the following references useful:

Jeff Gill's Generalized Linear Models: A Unified Approach.

James Hardin and Joseph Hilbe's Generalized Linear Models and Extensions, 2nd edition.

P. McCullagh and John Nelder's Generalized Linear Models, 2nd edition.

The basic point of the generalized linear model is to extend the approach taken in classical linear regression to models that have more complex outcomes but ultimately share the linearity property. In this respect, GLM subsumes classical linear regression, probit and logit analysis, loglinear and multinomial response models, and some models that deal with survival data to name a few.

In my experience, I have found that econometrics is taught in a compartmentalized manner. This makes sense to a certain extent, as different estimators are tailored to particular problems and data. GLM on the other hand allows the use of a common a technique for obtaining parameter estimates so that it can be studied as a single technique rather than as a collection of distinct approaches.

If interested in my ramblings, you can find a draft of my notes as an introduction to GLM here, as Blogger does not support LaTeX... Please note that this is a preliminary and incomplete draft (corrections and clarifications are very welcome). One thing it could definitely use is some clarification by example. However, as I noted, I have run into a bit of a wall trying to extend the binomial family to accept a vector of proportional data, and this is my intended example to walk through the theory and algorithm, so... a subsequent post will have to lay this out once I've got it sorted myself.

Generally speaking, there are two basic algorithms for GLM estimation: one is a maximum likelihood optimization based on Newton's method the other is commonly refered to as iteratively (re)weighted least squares (IRLS or IWLS). Our implementation now only covers IRLS. As will be shown, the algorithm itself is pretty simple. It boils down to regressing the transformed (and updated) outcome variable on the untransformed design matrix weighted by the variance of the transformed observations. This is done until we have convergence of the deviance function (twice the log-likelihood ratio of the current and previous estimates). The problem that I am running into with updating the binomial family to accept proportional data (ie., a vector of pairs (successes, total trials) instead of a vector of 1s and 0s for success or failure) is more mathematical than computational. I have either calculated the variance (and therefore the weights) incorrectly, or I am updating the outcome variable incorrectly. Of course, there's always the remote possibility that my data is not well behaved, but I don't think this is the case here.

More to come...

Project Status

2009-07-02T01:28:00.006-04:00

I have been making slow but steady progress on the NiPy models code. Right now for the midterm review, we have been focusing on design issues including the user interface and refactoring, test coverage/bug fixing, and some extensions for postestimation statistics. Other than this, I have spent the last month or so with anywhere from ten to fifteen stats, econometrics, or numerical linear algebra and optimization texts open on my desk.

The main estimators currently included in the code are generalized least squares, ordinary least squares, weighted least squares, autoregressive AR(p), generalized linear models (with several available distribution families and corresponding link functions), robust linear models, general additive models, and mixed effects models. The test coverage is starting to look pretty good, then there is just squashing the few remaining bugs and improving the postestimation statistics.

Some enhancements have also been made to the code. I have started to include some public domain or appropriately copyrighted datasets for testing purposes that could also be useful for examples and tutorials, so that every usage example doesn't have to start with generating your own random data. I have followed pretty closely to the datasets proposal in the Scikits Learn package.

We have also decided to break from the formula framework that is used in NiPy. It was in flux (being changed to take advantage of SymPy the last I heard) and is intended to be somewhat similar to the formula framework in R. In its place for now, I have written some convenience functions to append a constant to a design matrix or to handle categorical variables for estimation. For the moment, a typical model/estimator is used as

In [1]: from models.regression import OLS

In [2]: from models.datasets.longley.data import load

In [3]: from models.functions import add_constant

In [4]: data = load()

In [5]: data.exog = add_constant(data.exog)

In [6]: model = OLS(data.endog, data.exog)

In [7]: results = model.fit()

In [8]: results.params
Out[8]:
array([ 1.50618723e+01, -3.58191793e-02, -2.02022980e+00,
-1.03322687e+00, -5.11041057e-02, 1.82915146e+03,
-3.48225863e+06])

Barring any unforeseen difficulties, the models code should be available as a standalone package shortly after the midterm evaluation rapidly approaching in ten days. The second half of the summer will then be focused on optimizing the code, finalizing design issues, extending the models, and writing good documentation and tutorials so that the code can be included in SciPy!

Econometrics with Python

2009-07-01T09:34:00.005-04:00

There is as yet no equivalent of R in applied econometrics. Therefore, the econometric community can still decide to go along the Python path.

That is Drs. Christine Choirat and Raffello Seri writing in the April issue of the Journal of Applied Econometrics. They have been kind enough to provide me with an ungated copy of their review, "Econometrics with Python." Mentioning the, quite frankly, redundant general programming functions and tools that had to be implemented for R, the authors make a nice case for Python as the programming language of choice for applied econometrics. The article provides a quick overview of some of the advantages of using Python and its many built-in libraries, extensions, and tools, gives some speed comparisons, and also mentions a few of the many tools out there in Python community for econometrics including RPy (RPy2 is now available), and of course NumPy and SciPy. Having spent the last week or more trying to master the basic syntax and usage of R, I very much sympathize with this position. The one complaint I hear most often from my fellow students is that Python is not an industry standard. I hope this can change and is changing, because it's much more of a pleasure to work with Python than the alternatives and that makes for increased productivity.

Legendre on Least Squares

2009-06-15T21:09:00.002-04:00

I found the epigraph to Åke Björk's Numerical Methods for Least Squares Problems to be a nice intersection of my interests.

De tous les principes qu'on peut proposer pour cet object, je pense qu'il n'en est pas de plus general, de plus exact, ni d'une application plus facile que celui qui consiste à rendre minimum la somme de carrés des erreurs.

Of all the principles that can be proposed, I think there is none more general, more exact, or of an easier application than that which consists of rendering the sum of squared errors a minimum.

Adrien Marie Legendre, Nouvelles méthodes pour la détermination des orbites des comètes. Appendice. Paris, 1805.

Design Issues: Understanding Python's super

2009-06-12T12:33:00.013-04:00

The current statistical models package is housed in the NiPy, Neuroimaging in Python, project. Right now, it is designed to rely on Python's built-in super to handle class inheritance. This post will dig a little more into the super function and what it means for the design of the project and future extensions. Note that there are plenty of good places to learn about super and that this post is to help me as much as anyone else. [*Edit: With this in mind, I direct you to Things to Know about Python Super if you really want a deeper and correct understanding of super. This post is mainly a heuristic approach that has helped me in understanding basic usage of super.] You can find the documentation for super here. If this is a bit confusing, it will, I hope, become clearer after I demonstrate the usage.

First, let's take a look at how super actually works for the simple case of single inheritance (right now, we are not planning on using multiple inheritance in the project) and an __init__ chain (note that super can call any of its parent class's methods, but using __init__ is my current use case).

The following examples were adapted from some code provided by mentors (thank you!).

class A(object):
def __init__(self, a):
self.a = a
print 'executing A().__init__'

class B(A):
def __init__(self, a):
self.ab = a*2
print 'executing B().__init__'
super(B,self).__init__(a)

class C(B):
def __init__(self, a):
self.ac = a*3
print 'executing C().__init__'
super(C,self).__init__(a)

Now let's have a look at creating an instance of C.

In [2]: cinst = C(10)
executing C().__init__
executing B().__init__
executing A().__init__

In [3]: vars(cinst)
Out[3]: {'a': 10, 'ab': 20, 'ac': 30}

That seems simple enough. Creating an instance of C with a = 10 will also give C the attributes of B(10) and A(10). This means our one instance of C has three attributes: cinst.ac, cinst.ab, cinst.a. The latter two were created by its parent classes (or superclasses) __init__ method. Note that A is also a new-style class. It subclasses the 'object' type.

The actual calls to super pass the generic class 'C' and a handle to that class 'self', which is 'cinst' in our case. Super returns the literal parent of the class instance C since we passed it 'self'. It should be noted that A and B were created when we initialized cinst and are, therefore, 'bound' class objects (bound to cinst in memory through the actual instance of class C) and not referring to the class A and class B instructions defined at the interpreter (assuming you are typing along at the interpreter).

Okay now let's define a few more classes to look briefly at multiple inheritance.

class D(A):
def __init__(self, a):
self.ad = a*4
print 'executing D().__init__'
# if super is commented out then __init__ chain ends
#super(D,self).__init__(a)

class E(D):
def __init__(self, a):
self.ae = a*5
print 'executing E().__init__'
super(E,self).__init__(a)

Note that the call to super in D is commented out. This breaks the __init__ chain.

In [4]: einst = E(10)
executing E().__init__
executing D().__init__

In [5]: vars(einst)
Out[5]: {'ad': 40, 'ae': 50}

If we uncomment the super in D, we get as we would expect

In [6]: einst = E(10)
executing E().__init__
executing D().__init__
executing A().__init__

In [7]: vars(einst)
Out[7]: {'a': 10, 'ad': 40, 'ae': 50}

Ok that's pretty straightforward. In this way super is used to pass off something to its parent class. For instance, say we have a little more realistic example and the instance of C takes some timeseries data that exhibits serial correlation. Then we can have C correct for the covariance structure of the data and "pass it up" to B where B can then perform OLS on our data now that it meets the assumptions of OLS. Further B can pass this data to A and return some descriptive statistics for our data. But remember these are 'bound' class objects, so they're all attributes to our original instance of C. Neat huh? Okay, now let's look at a pretty simple example of multiple inheritance.

class F(C,E):
def __init__(self, a):
self.af = a*6
print 'executing F().__init__'
super(F,self).__init__(a)

For this example we are using the class of D, that has super commented out.

In [8]: finst = F(10)
executing F().__init__
executing C().__init__
executing B().__init__
executing E().__init__
executing D().__init__

In [8]: vars(finst)
Out[8]: {'ab': 20, 'ac': 30, 'ad': 40, 'ae': 50, 'af': 60}

The first time I saw this gave me pause. Why isn't there an finst.a? I was expecting the MRO to go F -> C -> B -> A -> E -> D -> A. Let's take a closer look. The F class has multiple inheritance. It inherits from both C and E. We can see F's method resolution order by doing

In [9]: F.__mro__
Out[9]:
(<class '__main__.F'>,
<class '__main__.C'>,
<class '__main__.B'>,
<class '__main__.E'>,
<class '__main__.D'>,
<class '__main__.A'>,
<type 'object'>)

Okay, so we can see that for F A is a subclass of D but not B. But why?

In [10]: A.__subclasses__()
Out[10]: [<class '__main__.B'>, <class '__main__.D'>]

The reason is that A does not have a call to super, so the chain doesn't exist here. When you instantiate F, the hierarchy goes F -> C -> B -> E -> D -> A. The reason that it goes from B -> E is because A does not have a call to super, so it can't pass anything to E (It couldn't pass anything to E because the object.__init__ doesn't take a parameter "a" and because you cannot have a MRO F -> C -> B -> A -> E -> D -> A as this is inconsistent and will give an error!), so A does not cause a problem and the chain ends after D (remember that D's super is commented out, but if it were not then there would be finst.a = 10 as expected). Whew.

I'm sure you're thinking "Oh that's (relatively) easy. I'm ready to go crazy with super." But there are a number of things must keep in mind when using super, which makes it necessary for the users of super to proceed carefully.

1. super() only works with new-style classes. You can read more about classic/old-style vs new-style classes here. From there you can click through or just go here for more information on new-style classes. Therefore, you must know that the base classes are new-style. This isn't a problem for our project right now, because I have access to all of the base classes.

2. Subclasses must use super if their superclasses do. This is why the user of super must be well-documented. If we have to classes A and B that both use super and a class C that inherits from them, but does not know about super then we will have a problem. Consider the slightly different case

class A(object):
def __init__(self):
print "executing A().__init__"
super(A, self).__init__()

class B(object):
def __init__(self):
print "executing B().__init__"
super(B, self).__init__()

class C(A,B):
def __init__(self):
print "executing C().__init__"
A.__init__(self)
B.__init__(self)
# super(C, self).__init__()

Say class C was defined by someone who couldn't see class A and B, then they wouldn't know about super. Now if we do

In [11]: C.__mro__
Out[11]:
(<class '__main__.C'>,
<class '__main__.A'>,
<class '__main__.B'>,
<type 'object'>)

In [12]: c = C()
executing C().__init__
executing A().__init__
executing B().__init__
executing B().__init__

B got called twice, but by now this should be expected. A's super calls __init__ on the next object in the MRO which is B (it works this time unlike above because there is no parameter passed with __init__), then C explicitly calls B again.

If we uncomment super and comment out the calls to the parent __init__ methods in C then this works as expected.

3. Superclasses probably should use super if their subclasses do.

We saw this earlier with class D's super call commented out. Note also that A does not have a call to super. The last class in the MRO does not need super *if* there is only one such class at the end.

4. Classes must have the exact same call signature.

This should be obvious but is important for people to be able to subclass. It is possible however for subclasses to add additional arguments so *args and **kwargs should be probably always be included in the methods that are accessible to subclasses.

5. Because of these last three points, the use of super must be explicitly documented, as it has become a part of the interface to our classes.

Working with Data: Record Arrays

2009-05-09T22:26:00.015-04:00

A few posts are going to be directed at people who are not as familiar with Python and/or SciPy. I will try to assume as little as possible about what the user knows. Over the summer these types of posts might be put together as a larger tutorial.

One of the goals of the SciPy stats project is to provide a transparent way to work with the different arrays types. Towards this end, we are going to work with some data as record arrays (note: right now this is only supported for files containing ASCII characters. If this is not the case, you must do some data cleaning beforehand with Python).

We have four data files: a comma-delimited .csv file with variable labels (headers) with all numeric data (here, info here), a comma-delimited file with headers with a mix of numeric and string variables (here, info here), this same file without headers (here), and this same file without headers that is tab delimited (here).

First, we are going to put the data into record arrays. Record arrays are simply structured arrays that allow access to the data through attribute access. If you are interested, you can have a look here for a bit more about the basic data types in NumPy. Then see here for more on structured and record arrays (and why access to record arrays is slower) and here for more details on creating and working with record arrays.

First download the data files and put them in a directory. I am using Linux, so my directory paths will look a little different than those using Windows, but all of the other details should be the same.

>>>import numpy as np
>>>educ_dta=np.recfromcsv('/home/skipper/scipystats/scipystats/data/educ_data.csv')

We now have our data in a record array. We can take a closer look at what's going on by typing

>>>educ_dta.dtype
dtype([('ahe', '<f8'), ('female', '<i8'), ('ne', '<i8'), ('midwest', '<i8'), ('south', '<i8'), ('west', '<i8'), ('race', '<i8'), ('yrseduc', '<i8'), ('ba', '<i8'), ('hsdipl', '<i8'), ('age', '<i8')])
>>>educ_dta.dtype.names
('ahe', 'female', 'ne', 'midwest', 'south', 'west', 'race', 'yrseduc', 'ba', 'hsdipl', 'age')
>>>educ_dta.ahe
array([ 12.5 , 6.25, 17.31, ..., 9.13, 11.11, 14.9 ])
>>>educ_dta['ahe']
array([ 12.5 , 6.25, 17.31, ..., 9.13, 11.11, 14.9 ])

We can do the same for each of our other data files. First we have another dataset that contains a string variable for the state name. We proceed just as above, though the string variables will have to be handled differently when used in a statistical model. Then we have the same file, but without any headers information. Last we have a file without headers that is tab-delimited.

>>>gun_dta=np.recfromcsv('/home/skipper/scipystats/scipystats/data/handguns_data.csv')
>>>gun_dta_nh=np.recfromcsv('/home/skipper/scipystats/scipystats/data/handguns_data_noheaders.csv', names=None)
>>>gun_dta_tnh=np.recfromtxt('home/skipper/scipystats/scipystats/data/handguns_data_tab_noheaders', names=None, delimiter='\t')

You can learn more about the possibilities for loading data into arrays here, or by having a look at the doc string for np.genfromtxt in the Python interpreter. All of the functions loadtxt, recfromcsv, recfromtxt, etc. use genfromtxt they just have different default values for the keyword arguments.

Getting Started with GSoC and SciPy

2009-04-24T09:35:00.005-04:00

I'm trying to resist making the obligatory "hello world" post here. The best I can do is only mentioning the urge.

First, a little bit about myself. My name is Skipper Seabold. I am finishing my first year as a PhD student in economics at American University in Washington, DC, and I have recently been accepted to the Google Summer of Code 2009 to work on the SciPy project. I have been a computer hardware and programming hobbyist since my middle school days. I have built my computers my whole life and back in high school tinkered around with Visual Basic (Apps for AOL 3.0 on Windows 3.x and Windows 95 anyone?), Turbo Pascal, C++, and Java mostly in the context of coursework. Two years ago I was introduced to the Python programming language, and I haven't looked back. Needless to say I'm very happy to have two of my interests, economics and programming, overlap.

This is where SciPy comes in. For those who are unfamiliar with SciPy, I direct you to the homepage here. In short, SciPy is an open source library of algorithms for numerical analysis for those working in engineering or the sciences more broadly defined. The SciPy library depends on NumPy. The Tentative NumPy Tutorial is a good place to start learning about the capabilities of NumPy. And likewise, the Getting Started page has plenty of resources to introduce you to the power of SciPy. In particular the tutorials, documentation, and cookbook are good to look at.

What I will be working on this summer is providing a consistent user interface for statistical models and appropriate statistical tests in SciPy similar to those found in other statistics/econometric software packages. I will also provide a unified development framework for those who would like to add to this effort in the future. Updates may be less regular over the next few weeks, but check here for at least weekly updates on the work over the summer.