I've thought a bit about how modelers approach brain areas whose functions are still not very well constrained by robust neurophysiological data. By this, I mean that there is simply not enough data to say, in plain terms, what that particular brain area does. In terms of visual cortex, this pretty much accounts for all areas beyond V1, namely V2, V3, V4, posterior IT (ITp), anterior IT (ITa), which all form a loose hierarchy (in the order they're listed), and whatever areas of the temporal lobe may be 'visual', e.g. entorhinal. These words may sound a bit harsh, or even better, like flame-bait. Yet, when a major computationalist publishes an article titled "How Close Are We to Understanding V1?" (to be read in the accusatory sense), and one takes into account that V1 is supposed to be the one area neuroscience figured out decades ago, well, that changes things.
Plenty of electrophysiology and anatomy has been conducted 'beyond' V1, especially in V2, which has been carefully studied for the past twenty years, so at least there isn't a paucity of data, though the electrophysiology normally tests the response of V2 cells to stimuli that were used to probe V1, e.g. oriented bars, light patches, etc. On first glance, it appears that V2 simply does what V1 does at a larger scale, which is a view I weakly subscribe to since, frankly, I don't have any better ideas. So, when modeling V2, we all tend to take whatever filter we used to simulate V1 and simply increase its size. Of course, data like this for example, make the picture a lot more complicated than we'd like to admit. Even if we model V2 with simple filtering operations, perhaps using a difference-of-Gaussians kernel or a Gabor function, a la V1, it seems that we already miss an important computation, whatever that may be.
Then, it seems that V4 may code for curvature, based on work from Pasupathy and colleagues, though older studies presumed that V4 coded for spatial frequency and orientation, somewhat similar to V1 and V2. Even if Pasupathy's work is extremely promising, it's difficult to say if either view is completely correct. As such, modelers tend to approach V4 as they do V2 or V1, namely, take the same filter and increase the spatial scale again. There is nothing wrong about this approach a priori. The problem, instead, is that we simply don't have enough data to say what V4 does for sure.
Without an understanding of V2 and V4, we simply don't know how human vision works, no way around it, but many of us are itching to model IT, both the posterior and the anterior portion (and perhaps the anteroventral portion if you think it forms an additional functional compartment). This is due to the putative function of IT: object recognition. Humans are uncanny when it comes to object recognition. The cliched opening line in any project on IT is that a human can recognize an object at different sizes, scales, lighting conditions, on different planets, under water, in a cloud, in a tree, on a bus, inside the House of Lords, as written by Jack Kerouac, and on and on. Anyway, assuming IT is the key to understanding human object recognition, then deciphering IT could lead to significant advances in pretty much any defense or industry application that requires looking at something and recognizing what it is. Suffice it to say, lots of money is at stake.
I myself have worked on modeling IT, and it is an area that could use some good, computationally-rigorous hypotheses. Most models are of a speculative nature, and propose what computations might lead to what experimentalists observe in the lab--all to the good. A number of these models, including my own to a large extent, assume that we can approximate the ventral stream through some sort of multi-scale filtering operation followed by a clustering algorithm, e.g. a support vector machine or nearest neighbor system. These give the experimental community a way to moor the designing of their experiments. To do so, we make a number of approximations to make life easier for us, and to keep the models at a reasonable scope. The nagging question, though, is whether we can hope to have a full computational picture of how IT does anything, let alone something as hopelessly complex as object recognition, without first understanding what sort of input it's supplied by V2 and V4. Without a comprehensive understanding, it's quite obvious that we can't derive a useful application for the world outside of the laboratory. That is, the approximations we use in our models normally come from the land of digital signal processing (DSP) or image processing, e.g. Gabors, difference of Gaussians, and scale-space. Without much data to go on, the approximations become less and less brain-like and more engineering-like (e.g. using wavelets or pyramids to approximate the ventral stream), meaning that we use engineering ideas to approximate the brain, as opposed to using the brain to derive new engineering approaches.
This presses on why we model areas that we can't completely understand. The answer, I take it, is to provide the experimental community with the best hypotheses we can muster. As for delivering a useful application, we'll just have to see.
I guess a question is the following: what is it known of the laminar structural differences between V1 and IT? This might shed some light on what they might be do differently, regardless of their input (in both cases, trains of incoming spikes, after all….).
Max
As far as I know, their laminar structure is roughly similar insofar as which layers receive input, primarily 4, and which project to subsequent areas in whatever hierarchy you have in mind (vision, memory, executive function, whatever). This is far more general than you’re expecting I’m sure, but the precise cytoarchitecture hasn’t been as thoroughly characterized as that of V1; that is, I’m not aware of too much slice work from IT. I would look at Kathleen Rockland’s lab, which has done some of the most detailed work around the upper reaches of the ventral stream (V4 through anterior IT). Either way, I think this sort of data has not resulted in the sort of comprehensive picture one would need to explain human object recognition, let alone something we could use in a technological application.
The question, at this point, is not how do IT cells respond to visual stimuli, or what the biophysical properties of IT cells are (though this would be wonderful, insanely useful data to have). We have a truckload of electrophysiological data, even if some of it falls prey to the sort of difficulties pointed out in an article from Pinto et al. in PLoS. I think the general problem is that we’re looking at a filter’s response, assuming IT does some form of filtering at a reasonable level of approximation, without a clue about what its input looks like. It makes trying to derive a transfer function, perhaps one you could use in an object recognition system, impossible.
Before incorporating precise biological data into the functional picture of IT, we need a good, basic, characterization at the electrophysiological level, something akin to Hubel and Wiesel’s seminal work in LGN and V1. The power of their cartoon model is that it gives a picture of how input from the retina is transformed into useful information at V1 via LGN. We have no such cartoon for V2, V4, ITp and ITa. This is not due to a lack of recordings from IT, but rather, a lack of data from multiple areas that we can tie into a unified picture. Perhaps simultaneous recordings of a V4 cell that synapses onto an ITp cell attacks this problem most directly, similar to Alonso and Reid’s critical work in LGN and V1. I recall someone in DiCarlo’s lab working on something like this, or at least he mentioned it at his last talk. We’ll see what fruit it bears.
Anyway, after establishing such a general framework, then maybe it makes sense to unpack laminar dynamics, followed by fine-grained in vitro work, etc. etc. So, in short, the answer is: not much, but maybe other coarser-grained data will give us a rough picture in the near future.