The latest and greatest
Jeff Markowitz | February 23, 2009
Riesenhuber and Poggio supplied a seminal model of object recognition in 1999. It derived a lot of its power from sheer simplicity. With just a few mathematical operations it seemed to model the entirety of the ventral stream, the area of the brain dedicated to processing “What” information, i.e. information about the identity of an object. It starts with a layer of Gaussian-tuned `simple’ or S cells, which respond to particular line orientations. That is, a particular S cell might respond to a diagonal line in a particular spot in an image. Then, all S cells of the same orientation feed to a ‘complex’ or C cell, which represents the maximally activated S cell. In CS terms, they take an argmax over a local neighborhood.
This key operation results in model cells that demonstrate invariance to position. They respond in the same way to a diagonal line anywhere in the input image. For instance, a group of diagonally-tuned S cells will respond differently to a diagonal line at different parts of the input image, but the maximum over all these cells will not change as long as a diagonal line is present. So, after a few more stages of S and C cells, the model generates an invariant representation of a complex object, which is then fed to a classifier, perhaps a support vector machine (SVM). The SVM determines whether features in the invariant representation correspond to a particular category. This type of classifier presumably models the function of the prefrontal cortex, which has been implicated in categorization through work in Earl Miller’s lab. Even if the model skimmed over some of the finer details about the ventral stream, it presented a nice, easily understandable theory that you could implement in a matter of hours.
Serre and colleagues have taken over the model, with the latest developments documented in this 2007 PNAS article. It has performed some very impressive technical feats like recognizing actions in a video feed, and, after nearly a decade, the model has surprisingly retained most of its original structure. In fact, aside from adopting the use of Gabor filters, it seems to use most of the same essential computations. But, along with these computations comes a bit of theoretical baggage. Like most models of object recognition, this model works under the presumption that cells become more and more invariant progressing through the ventral stream. In other words, the receptive fields of IT cells should be much larger than those of V1 cells (even the notion of IT cells having receptive fields is tricky, since IT has no retinotopy). This is fine as a rough approximation, but recent data in human and monkey casts some doubt on the specifics of this hypothesis. The ventral stream may construct an invariant representation, but it’s unclear if it does so according to the principles of Serre et al.’s model (still this shouldn’t detract from its technical accomplishments).
Of course, at the moment, it looks like we’re woefully short on alternatives.
PNAS, 2007. DOI: 10.1073/pnas.0700622104
(Image from Flickr user Chuckumentary)






