Monday, June 15, 2009

Legendre on Least Squares

I found the epigraph to Åke Björk's Numerical Methods for Least Squares Problems to be a nice intersection of my interests.

De tous les principes qu'on peut proposer pour cet object, je pense qu'il n'en est pas de plus general, de plus exact, ni d'une application plus facile que celui qui consiste à rendre minimum la somme de carrés des erreurs.

Of all the principles that can be proposed, I think there is none more general, more exact, or of an easier application than that which consists of rendering the sum of squared errors a minimum.

Adrien Marie Legendre, Nouvelles méthodes pour la détermination des orbites des comètes. Appendice. Paris, 1805.

Friday, June 12, 2009

Design Issues: Understanding Python's super

The current statistical models package is housed in the NiPy, Neuroimaging in Python, project. Right now, it is designed to rely on Python's built-in super to handle class inheritance. This post will dig a little more into the super function and what it means for the design of the project and future extensions. Note that there are plenty of good places to learn about super and that this post is to help me as much as anyone else. [*Edit: With this in mind, I direct you to Things to Know about Python Super if you really want a deeper and correct understanding of super. This post is mainly a heuristic approach that has helped me in understanding basic usage of super.] You can find the documentation for super here. If this is a bit confusing, it will, I hope, become clearer after I demonstrate the usage.

First, let's take a look at how super actually works for the simple case of single inheritance (right now, we are not planning on using multiple inheritance in the project) and an __init__ chain (note that super can call any of its parent class's methods, but using __init__ is my current use case).

The following examples were adapted from some code provided by mentors (thank you!).

class A(object):
def __init__(self, a):
self.a = a
print 'executing A().__init__'

class B(A):
def __init__(self, a):
self.ab = a*2
print 'executing B().__init__'

class C(B):
def __init__(self, a): = a*3
print 'executing C().__init__'

Now let's have a look at creating an instance of C.

In [2]: cinst = C(10)
executing C().__init__
executing B().__init__
executing A().__init__

In [3]: vars(cinst)
Out[3]: {'a': 10, 'ab': 20, 'ac': 30}

That seems simple enough. Creating an instance of C with a = 10 will also give C the attributes of B(10) and A(10). This means our one instance of C has three attributes:, cinst.ab, cinst.a. The latter two were created by its parent classes (or superclasses) __init__ method. Note that A is also a new-style class. It subclasses the 'object' type.

The actual calls to super pass the generic class 'C' and a handle to that class 'self', which is 'cinst' in our case. Super returns the literal parent of the class instance C since we passed it 'self'. It should be noted that A and B were created when we initialized cinst and are, therefore, 'bound' class objects (bound to cinst in memory through the actual instance of class C) and not referring to the class A and class B instructions defined at the interpreter (assuming you are typing along at the interpreter).

Okay now let's define a few more classes to look briefly at multiple inheritance.

class D(A):
def __init__(self, a): = a*4
print 'executing D().__init__'
# if super is commented out then __init__ chain ends

class E(D):
def __init__(self, a): = a*5
print 'executing E().__init__'

Note that the call to super in D is commented out. This breaks the __init__ chain.

In [4]: einst = E(10)
executing E().__init__
executing D().__init__

In [5]: vars(einst)
Out[5]: {'ad': 40, 'ae': 50}

If we uncomment the super in D, we get as we would expect

In [6]: einst = E(10)
executing E().__init__
executing D().__init__
executing A().__init__

In [7]: vars(einst)
Out[7]: {'a': 10, 'ad': 40, 'ae': 50}

Ok that's pretty straightforward. In this way super is used to pass off something to its parent class. For instance, say we have a little more realistic example and the instance of C takes some timeseries data that exhibits serial correlation. Then we can have C correct for the covariance structure of the data and "pass it up" to B where B can then perform OLS on our data now that it meets the assumptions of OLS. Further B can pass this data to A and return some descriptive statistics for our data. But remember these are 'bound' class objects, so they're all attributes to our original instance of C. Neat huh? Okay, now let's look at a pretty simple example of multiple inheritance.

class F(C,E):
def __init__(self, a): = a*6
print 'executing F().__init__'

For this example we are using the class of D, that has super commented out.

In [8]: finst = F(10)
executing F().__init__
executing C().__init__
executing B().__init__
executing E().__init__
executing D().__init__

In [8]: vars(finst)
Out[8]: {'ab': 20, 'ac': 30, 'ad': 40, 'ae': 50, 'af': 60}

The first time I saw this gave me pause. Why isn't there an finst.a? I was expecting the MRO to go F -> C -> B -> A -> E -> D -> A. Let's take a closer look. The F class has multiple inheritance. It inherits from both C and E. We can see F's method resolution order by doing

In [9]: F.__mro__
(<class '__main__.F'>,
<class '__main__.C'>,
<class '__main__.B'>,
<class '__main__.E'>,
<class '__main__.D'>,
<class '__main__.A'>,
<type 'object'>)

Okay, so we can see that for F A is a subclass of D but not B. But why?

In [10]: A.__subclasses__()
Out[10]: [<class '__main__.B'>, <class '__main__.D'>]

The reason is that A does not have a call to super, so the chain doesn't exist here. When you instantiate F, the hierarchy goes F -> C -> B -> E -> D -> A. The reason that it goes from B -> E is because A does not have a call to super, so it can't pass anything to E (It couldn't pass anything to E because the object.__init__ doesn't take a parameter "a" and because you cannot have a MRO F -> C -> B -> A -> E -> D -> A as this is inconsistent and will give an error!), so A does not cause a problem and the chain ends after D (remember that D's super is commented out, but if it were not then there would be finst.a = 10 as expected). Whew.

I'm sure you're thinking "Oh that's (relatively) easy. I'm ready to go crazy with super." But there are a number of things must keep in mind when using super, which makes it necessary for the users of super to proceed carefully.

1. super() only works with new-style classes. You can read more about classic/old-style vs new-style classes here. From there you can click through or just go here for more information on new-style classes. Therefore, you must know that the base classes are new-style. This isn't a problem for our project right now, because I have access to all of the base classes.

2. Subclasses must use super if their superclasses do. This is why the user of super must be well-documented. If we have to classes A and B that both use super and a class C that inherits from them, but does not know about super then we will have a problem. Consider the slightly different case

class A(object):
def __init__(self):
print "executing A().__init__"
super(A, self).__init__()

class B(object):
def __init__(self):
print "executing B().__init__"
super(B, self).__init__()

class C(A,B):
def __init__(self):
print "executing C().__init__"
# super(C, self).__init__()

Say class C was defined by someone who couldn't see class A and B, then they wouldn't know about super. Now if we do

In [11]: C.__mro__
(<class '__main__.C'>,
<class '__main__.A'>,
<class '__main__.B'>,
<type 'object'>)

In [12]: c = C()
executing C().__init__
executing A().__init__
executing B().__init__
executing B().__init__

B got called twice, but by now this should be expected. A's super calls __init__ on the next object in the MRO which is B (it works this time unlike above because there is no parameter passed with __init__), then C explicitly calls B again.

If we uncomment super and comment out the calls to the parent __init__ methods in C then this works as expected.

3. Superclasses probably should use super if their subclasses do.

We saw this earlier with class D's super call commented out. Note also that A does not have a call to super. The last class in the MRO does not need super *if* there is only one such class at the end.

4. Classes must have the exact same call signature.

This should be obvious but is important for people to be able to subclass. It is possible however for subclasses to add additional arguments so *args and **kwargs should be probably always be included in the methods that are accessible to subclasses.

5. Because of these last three points, the use of super must be explicitly documented, as it has become a part of the interface to our classes.