Time as a teacher

By Derek James | June 28, 2009

teacherAnother guest editor here... I met Max at this year's ICCNS and he suggested writing a guest entry for Neurdon. The ideas hopefully compliment some of the stuff Ethan blogged about.

I'm a 4th-year PhD student in the Institute of Cognitive Science at The University of Louisiana at Lafayette. When I entered the program, I was mostly interested in AI and evolutionary algorithms. I wanted to evolve a Go-playing program. But my interests shifted, especially in my first year when I read Jeff Hawkins' On Intelligence. I thought it was great stuff, and I liked two things central to his framework: 1) The temporal aspect of cognition, and 2) The crucial role of feedback. He made a convincing case that every modality and skill is essentially a matter of learning and processing sequences. So that's where I started focusing my attention.

Ethan's entry discusses the particular case of using the temporal regularities in input to form invariant object representations. I'm going to discuss the more general case of how a network model can be structured in order to exploit the temporal structure of input in order to form representations of sequences. I believe that if we can understand how a system can learn and subsequently process a sequence such as C->A->B in an unsupervised manner, using time as the teacher, then we've got a basis for understanding many cognitive processes, from the formation of representations of musics and language, to visual object recognition, to somatosensory representations, and so on.

I believe the way in which we learn a song and the way in which we learn to recognize a toy duck are very similar. Both are learned and processed sequentially, with one important distinction. The song is order-sensitive, i.e. the order of the input matters, whereas the sequences representation of objects is order-invariant. Whether the duck moves from right to left or left to right, or is rotated clockwise or counterclockwise, or is zoomed in and out, you still recognize it as a duck.

I began working on a model of unsupervised sequence learning that focuses on order-sensitive sequences, under the assumption that such a model could then be generalized to learning and processing order-invariant sequences.

I was a bit frustrated that in Hawkins' book he didn't provide many details about how learning might occur. He briefly mentions Hebbian learning and says that it could by an underlying mechanism for the framework he proposes, but he doesn't say how. Hawkins has gone on to co-found a company devoted to implementing his vision with Dileep George. So I followed up by reading George's 2008 PhD dissertation "How the Brain Might Work: A Hierarchical and Temporal Model for Learning and Recognition". The model is a hierarchical Bayesian approach which uses a method called temporal pooling. If I'm understanding it correctly, the model is exposed to a bunch of stimuli and then builds representations based on the likelihood of a stimulus at a particular point in time transitioning to another stimulus at the next time step. This is essentially how learning occurs in nodes at lower levels of the hierarchy. What I was a bit disappointed to find out is that the top-level node, which associates these representations with a unique identifier (e.g. "dog") is supervised.

So I set out to design a simple model that was grounded in neurophysiological learning mechanisms that tried to answer this central question:

How might order-sensitive sequences be learned in an unsupervised way, exploiting the temporal structure of the input?

I chose to use a neural network model composed of leaky integrate-and-fire units. The learning mechanism would be spike-timing dependent plasticity (STDP), which seemed a natural fit for learning causal relationships in time-varying input.

The basic approach I used was that of recruitment learning [Feldman, 1982; Valiant, 1994], in which a random graph structure allocates representations to "free" nodes through the conjunction of activity of nodes feeding into them. But we know the neocortex is not a random graph structure. The neocortex has a high degree of hierarchy and regularity in its connectivity patterns. So how might that structure be used to recruit representations of sequences?

Here's the basic building block for my model:

figure1

In its current state, the model only uses feedforward excitatory connectivity, though I plan to incorporate feedback in later iterations. Basically this module consists of input nodes a and b, delay units a' and b', and output units 1 and 2. The delay units mirror the state of units a and b, delayed by some interval. In actual neural systems, the delays may be a result of axonal length, synaptic delays, or other mechanisms. We know there are delays in the transmission of signals in the brain. The underlying motivation here is that those delays are actually a feature which exploits the temporal nature of input in order to learn and process. In other words, its a feature, not a bug. The connection with circles denotes a winner-take-all (WTA) mechanism.

So how does the network learn? Say we want to learn the sequence b->a. First, the network is exposed to stimulus b. Input node b fires, but the weights are initialized such that activity from a single presynaptic unit is not sufficient to drive the firing of a postsynaptic unit, so nothing happens.

Next, the network is exposed to stimulus a, so unit a fires. But if the temporal structure of the input is such that the interval between the stimuli is the same as the delay interval for b', then units a and b' fire simultaneously, and their combined output is sufficient to drive an output unit to fire (which output unit fires is determined by the random initialization of the weights). The WTA mechanism insures that only one unit in this layer will fire. Let's say that in this example, unit 2 was driven to fire by the presynaptic activity of a and b'. Long-term potentiation would be induced on connections a-to-2 and b'-to-2 because of their successive pre-post firing. And thus, unit 2 would be recruited to represent the sequence b->a, so that it would reliably fire upon the presentation of b then a, and not fire in the presence of any other sequence. When the connections a-to-2 and b'-to-2 are strengthened due to LTP, all other incoming connection weights to unit 2 are weakened due to heterosynaptic LTD, insuring that unit 2 cannot be recruited to represent another sequence, thus avoiding the classic problem of hash collisions.

Now, if we can use this building block as the basis for building larger, and more importantly, hierarchical networks.

figure21

Here is a 4-layer network built on the same principles. If the network is exposed to sequential input in the first layer, the second layer will form sequence representations of length 2 (e.g. b->a), the third layer representations of length 4 (e.g. b->a->c->d), and the fourth layer rpresentations of length 8. There is evidence suggesting the presence of temporal receptive fields analogous to spatial receptive fields [Hasson et al. 2008], and this model is based on such an idea.

This model is capable of learning sequence representations in an unsupervised way, in one shot. It is hierarchical, so there is reuse of subsequences that have already been learned to form new higher-level representations. The model demonstrates how a representation of a sequence might be formed due to the conjuction of the current and delayed activity of neurons, representing the present and immediate past states of the environment. As mentioned earlier, this version learns order-sensitive sequences, but the same basic mechanisms could work to form order-invariant sequence representations, and the model's dimensionality could be expanded in order to handle sequences of 2D images.

That's not to say it doesn't have problems. It's extremely brittle, due to its size and specific delay tuning. The loss of a single unit breaks it, and even slight perturbations to the temporal structure of the input break it as well. But I hope it's a good starting point for elaborating more robust models of unsupervised sequence learning using the principle of time as the supervisory signal.

Derek James
The Institute of Cognitive Science at The University of Louisiana at Lafayette

7 Responses to Time as a teacher

  1. Derek,

    thanks for the very interesting post.

    Some of your ideas intersect in certain aspect with Eugene Izhikevich. He visited our department about 2 years ago, and stayed for a full day, where I had the chance to appreciate his very original ideas. One of them is known with the name of polychronization. I would check it out. Some of the concepts are similar to what you propose: synaptic delays are a feature, not a defect. They are used to help patterns of invariant activation to form in random networks of spiking neurons with STDP learning. In reality, polychornization networks are everywhere… where you do not constrain networks of spiking neurons by imposing fixed delays. Even in my work I had polychonizing networks, but I did not know it until Eugene pointed it out!

    You seem to add an interesting twist to this polychronization story, something in between Jeff’s and Eugene’s ideas. Very intriguing. I would be curious to learn more of…what you are learning. One big issue, which you start alluding to, is the robustness of the network. Since timing is so important, changes in stimulus (or other cell’s output) intensity is reflected in changes in timing of spike, which in turns has a profound effect on what cell learn, namely what sequence is stored via STDP.

    Max

  2. Ethan says:

    hey, yes, these are interesting ideas. if you are looking for some neurophysiological evidence in IT and PFC that might support these types of ideas, you might want to check out a paper worked on last year that appeared in the journal of neurophysiology. in the paper i showed that different neurons seem to contain the same information at different points in time relative to the start of a trial – which is similar to delay units listed above. in the paper i also speculated that such a representation could be used to learn to recognize sequences of objects so it seems that we’re all converging on similar ideas (although it is also possible that these changes representations could be related to the fact that information was being transformed from a visual format to an abstract format, or that these changing patterns could be useful in preventing current information from destroying information about what was previously seen, so without additional work it is tough to say for sure).

  3. Derek James says:

    Max: Thanks…yes, Izhikevich was a plenary speaker at IJCNN and I really enjoyed his talk. I was a bit skeptical about constructing large-scale models with hundreds of thousands of units when we’re still ignorant about how very small, local networks function, but I’m better convinced now that such approaches might yield interesting insights. There was a student in our program who studied polychronous groups last semester, attempting to determine how the number of such groups scales with various parameters like network size and connectivity. There’s definitely some conceptual overlap with my interests and what Izhikevich is doing, although he’s definitely working on a much larger scale. And I’ve read a couple of his papers, but I’m not quite sure about the link between representation and polychronous groups, i.e. how a polychronous group forms in response to stimuli and reliably activates in the presence of that stimuli. It seemed to me from what I had read that Izhikevich wasn’t directly tackling representation, just proposing that there are these sorts of groups and they could be very important in how the brain works. When he visited Boston, did you talk about this particular issue?

    Ethan: Thanks for the reference. I’ll definitely check it out. I do hope these similar ideas on on the right track. :)

  4. Derek,

    you are right on the spot with Eugene’s. He did not seem to have an answer to the question on how you stably reactivate a representation over time if you have STDP that continuously sculpt your synapses. This is what my advisor, Stephen Grossberg, calls the “stability plasticity dilemma”. It is in fact a dilemma, and I have been working on these issues for quite some time. These sort of problems are also key in our new SyNAPSE grant, and indeed in any neural circuit that has online plasticity (namely, you do not constrain learning artifically). May be Eugene has some new results on the topic? Should we ask him to do a post here???

    I can write him an email!

    Max

  5. Tim Barnes says:

    Derek,

    Thanks for sharing the interesting idea. I’m particularly intrigued that your model predicts that there is at least a theoretical capacity for representations to increase exponentially in complexity per sequencing layer. If the temporal sequence learning idea is similar to more conventional receptive fields, the idea may help highlight why ‘object representations’ can already be so complex/invariant in IT (speaking from the vision paradigm, sadly the only one I know well).

    I think your title is clever, but I have to admit, however, I stopped in confusion for a minute by the idea of “…unsupervised learning, using time as a teacher, …” I figure you mean that the unsupervised learning, clustering, whatever you want to call it, occurs in the temporal domain, and that is the extent of it, but still, I just wanted to give you a heads up in case someone else calls you out on that apparent paradox.

    Coming from the vision paradigm again, if you’re looking for some leads on how to incorporate feedback into your model, you might want to take a look at Reverse Hierarchy Theory if you haven’t done so already. The general idea is that feedback learning will alter lower layers to better respond to the pieces that originally built the complex representation at higher layers. This would, in a sense, specialize the network but also increase the rate at which ‘word length’ increases while traveling up the model layers. This would also push the top layers to reach their maximum capacity as well; I’m envisioning something like eliminating overlaps in sequences, e.g. {(a->c->b->d), (c->b->d->e), (b->d->e->f)} into {(a->c->b->d), (b->d->e->f), (room for more…)} Then again, I don’t think RHT is the most concrete theory in the world, so it might not be worth more than a quick look.

    Thanks again!

  6. Derek,

    Really interesting piece of text! Jeff Hawkins’ description is supposed to be an abstract model of the cortex. Where can I position your model in the brain? As an engineer I can imagine that there are multiple implementations of hierarchical, sequential ordering possible. One more fit then the other for certain circumstances. For example, it can even be implemented on a single neuron level to detect the time difference between sounds arriving at the “ears”, it’s one of the Izhikevich neuron types ;-) . And for example at the basal ganglia side, which is often said to implement action selection, it makes sense to have their some “internal supervised/reinforcement” signal in the sense of dopamine. And it wouldn’t be wise of evolution not to speculate about the possible consequences of its actions. So, is your model supposed to govern storing of order in the cortex?

    Moreover, order is something quite peculiar, it’s so easy to get combinatorial explosions, in the case it is not recognized that there is NO order involved somewhere. If A,B,C and D are received in different orders, one at a time, while their actual occurrence is random, 4! permutations have to be received (multiple times) to comprehend that fact. So, a real-world system as the brain (on the cortex level) has probably some neural circuitry involved that selects “candidates”. I consider in this context stability-plasticity as being able to store a sequence of events and dismiss this sequence later, depending on a sort of vigilance parameter. Do you foresee some stability-plasticity solution within your hierarchy by including recurrent connections?

    And actually a very stupid question. Is your hierarchy supposed to store a pattern like: A,B,C and D,B,A? So, A and B in different order depending on the context?

    However, I am a layman still regarding neuroscience… I hope that will end soon! :-) Thanks for your article,

    Anne

  7. Derek James says:

    Sorry for the late responses…

    @Tim

    Nice that you picked up on the aspect of the model that it accounts for non-linear scaling of sequence lengths.

    I’m not too worried about the apparent contradiction in calling the way it which the model learns “unsupervised” and describing time as the teacher. I’m using “unsupervised” in the traditional machine learning sense, referring to paradigms in which the desired outputs of the system are known in advance and the difference between actual and desired output is used to generate an error signal which is used as an update rule for learning. Though in the future it’s probably not a bad idea to explicitly state that.

    And thanks for the pointer to Reverse Hierarchy Theory. I hadn’t heard of it.

    @Anne

    Like Hawkins’, mine is also meant to be a very simple, abstract model of neocortex.

    You bring up reinforcement learning…and yes, this model could easily be augmented with a reinforcement signal.

    As for the type of patterns the particular network shown can store, it actually cannot store A-B-C or D-B-A. This toy example can only store sequences of length 2, 4, and 8, such as A-B-D-C. But yes, distinct nodes are recruited to represent the forward and backward sequences, e.g. a different nodes encode for A-B and B-A. I don’t know if this answers your question. If not, please let me know.

Leave a Reply

Your email is never published nor shared. Required fields are marked *

*

You may use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>