Max asked me to post some information about how time could act as a ‘supervising’ learning signal to create invariant representations in IT (particular in reference to Jim DiCarlo’s work in this area). Since I am lazy, the below post is a modified section of the background from my thesis proposal - hopefully it’s not too boring….
One framework for studying IT is to examine what transformations of a given stimulus are neurons invariant/tolerant to (Ashbridge and Perrett 1998; Rolls 2000). The logic behind this approach is that primates need to be able to recognize objects under a variety of different viewing conditions such as changes to the object’s size, position, ambient illumination and surrounding clutter, all of which give rise to very different retinal images of an object; thus, if IT is critical for robust object recognition, there should be neurons in this brain region that respond similarly despite changes in such parameters. Several single neuron analyses examining this issue have found neurons in AIT that respond similarly to images of an object when there have been significant changes in the position and size (Ito et al. 1995; Lueschow et al. 1994; Miyashita and Chang 1988; Schwartz et al. 1983), which supports this theory (although it should be noted that the majority of neurons in IT do seem to respond best to a particular size/position[1] (Ashbridge and Perrett 1998; DiCarlo and Maunsell 2003; Ito et al. 1995; Lueschow et al. 1994)). Other studies have looked at more complex transformation of stimuli including shape defined by texture and motion (Sary et al. 1993), mirror reversals of shapes (Baylis and Driver 2001; Rollenhagen and Olson 2000), contrast changes/reversal (Baylis and Driver 2001; Zoccolan et al. 2007), and rotation of familiar 3D shapes (Booth and Rolls 1998; Logothetis et al. 1995), and have also found that there are neurons in IT that respond similarly despite these changes in stimulus properties (although again, many neurons are more tuned to particular ranges of parameters and there does seem to be a tradeoff between how selectively a neuron respond to particular stimuli and how tolerant it is to different transformations (Zoccolan et al. 2007)).
Learning mechanisms have also been proposed that can explain how neurons in IT could obtain such invariant properties which brings together IT’s involvement in learning with IT’s involvement in shape discrimination. In such theories, the fact that there is often temporal contiguity to the transformations that images of objects undergo in the world is combined with an associative learning so that neurons become invariant to particular object transformations (Foldiak 1991; Wallis and Rolls 1997; Wiskott and Sejnowski 2002). For example, since objects are often seen at slightly different sizes in a precise temporal sequence as one approaches an object, a Hebbian learning rule could cause a downstream neuron to pool together the responses of two upstream neurons that each respond only to one image of a particular size, due to the upstream neurons firing in close temporal proximity; thus after such learning, the downstream neuron would respond similarly to a particular object regardless of the object’s size. Similar mechanisms could explain the position, illumination and view tolerant neuron responses found in IT. Psychophysical evidence has shown that humans indeed experience perceptual learning for temporally contiguous images, such that images that occurred in a temporal sequence are subsequently perceived as being more similar (Cox et al. 2005; Wallis and Bulthoff 2001) and computational models have been build around this principle (Foldiak 1991; Wiskott and Sejnowski 2002; Serre et al., 2005). Also, recent neurophysiological experiments have shown that temporal binding can indeed change neuron’s selectivity and create ‘false invariances’, which is strong support for this theory (Li and DiCarlo, 2008).
[1] However even when neurons do respond more to a particular size/location, the ordinal order of stimulus selective for almost all neurons seems to remain the same at the preferred and non-preferred locations – thus the changes at a preferred size/location can best be explained in terms of a change in gain in the neuron’s response (DiCarlo and Maunsell 2003; Ito et al. 1995; Lueschow et al. 1994).
Nice post. Question: what is the main difference of your lab’s modeling approach to Jeff Hawkins’s Numenta platform? It seems that temporal contiguity as a teaching signal is a common theme, but, while sitting with Jeff during his visit two years ago in our Department, I learned that there are various “clustering” stages that are performed both in space and time to achieve invariance. Thanks!
Max Versace
Cool post. I’m a PhD student in Cognitive Science at The University of Louisiana at Lafayette. Ethan, were you at ICCNS? If so, too bad we didn’t get a chance to chat. My dissertation work centers around this very topic, a hierarchical spiking neural network model that learns sequences in an unsupervised manner using STDP and the temporal structure of the input to learn representations. I found DiCarlo’s talk at ICCNS very interesting, having read the Cox et al. 2005 paper but not being aware of the Li and DiCarlo work from last year.
And hello Max. Too bad we couldn’t get a game together at the conference. I believe George and Hawkins refer to the clustering stages you’re talking about as “temporal pooling”. I read through George’s dissertation, but for me there was a significant gap between how the model was learning invariant representations and how biological neurons might implement similar algorithms. Luckily this provided motivation for my dissertation work.
As for the swap paradigm used in both the human and monkey studies, I was wondering how well a paradigm would work that involved a subject observing a moving object where the image is swapped periodically (ideally only for a short time period, possibly such that they can’t subjectively notice the swap) as the subject tracks the stimuli moving across a field. Similar experiments could also be done by swapping while rotating or swapping while increasing and/or decreasing the size of the stimulus. Has the swap paradigm been used in this way?
Hi Derek, good to hear from you!…. now, I want to know more about what you do. In particular, learning invariant object representation (or ANY pattern, from what I am concerned) via spiking dynamics and STDP is something I am craving to know more about…how does your simulation approach differ from Jeff’s and DiCarlo’s?
Max
[...] to Ethan, you’re starting to make this whole enterprise a little less incestuous! Anyway, your recent post raises a number of interesting issues regarding inferotemporal cortex (IT), most prominently: how [...]
Are any of you going to be at the IJCNN next week? I’m presenting a paper there where I talk about my model.
It’s a bit too long to go into much detail in a blog comment, but the basic idea is that the model is hierarchical, with standard forward connections and also delay lines between layers. Representations are formed by the conjunctive activity of current and delayed activation from units in lower layers, effectively binding past and present activity at a higher level in the hierarchy. The successive firing of the units in the lower layer, closely followed by the firing of the unit in the higher layer strengthens the weights between them, effectively recruiting the higher-level unit to represent the conjunction.
This probably isn’t very clear from the brief description. Max, you had mentioned possibly contributing a guest entry. If you like I can try to put something together after the IJCNN. Just let me know.
Hi Derek,
sounds really interesting….yes, please, go ahead and prepare a post… figures will be helpful!
Max
[...] Another guest editor here… I met Max at this year’s ICCNS and he suggested writing a guest entry for Neurdon. The ideas hopefully compliment some of the stuff Ethan blogged about. [...]