If you want to design robots able to interact to the real world in a useful way, you will eventually bump into the problem of implementing robust object recognition, when by robust I mean able to recognize objects irrespective of (or at least able to tolerate variation in..) distance from the object, its orientation, illumination conditions, etc.
This post describes work done the Neuromorphics Lab, using the Cog Ex Machina software platform to recognize objects in an iRobot Create platform.
* Thanks to Jasmin Leveille for producing all the simulations, robotic demos, and much of the text below
First, a disclaimer: I work in that lab. Second, a bit of history may be useful. Many of you recall the DARPA SyNAPSE project, which initially sponsored HP, IBM and HRL in the design of brain-inspired hardware. Along with my colleagues of the Neuromorphics Lab, I was (and I am still...) collaborating with HP in the design of a software framework to exploit massively parallel processors (e.g., GPUs and GPU clusters) via the Cog Ex Machina (or Cog) software platform (although our group and HP parted way with SyNAPSE some time ago).
Now, the interesting stuff. After our initial effort to implement learning in a virtual world with Cog, we have turned our attention to learning to recognize objects. Among candidate learning models to be used to learn complex stimuli, such as objects, one which marry simplicity with performance is Contrastive Divergence. Well... this is not properly an "object recognition" algorithm, rather a learning algorithm that lacks the additional, and wonderful machinery that the brain uses to perceive, segment, group, and in short build a meaningful set of features that can be used to really make sense of complex visual scenes. But in simple environments and fairly simple cases, Contrastive Divergence would do the job.
Contrastive Divergence is a recently proposed method to train Products of Experts (PoE) models by approximating the gradient of the log-likelihood (Hinton, 2002). A restricted Boltzmann machine (RBM) is an example of a PoE in which each hidden unit corresponds to one expert. The topology of a RBM can be displayed as a two-layer neural network as in Fig.1.
Fig.1. A restricted Boltzmann machine.
How are units (or neurons) described in binary RBMs? The activity of each binary unit yj is set to 1 with probability:
where sigma is the logistic function, xi is the input from unit i, wij is the synaptic weight from unit i to unit j and bj is a bias (Hinton and Salakhutdinov, 2006). Synaptic weights can be trained by following the gradient of the log-likelihood L of the data:
where the first and second terms correspond to the expectations over the data distribution (Q0) and over the equilibrium distribution (Qinf), respectively. Unfortunately, evaluating the second term is very inefficient, and Hinton (2002) proposed instead computing the expectation over the distribution of the one-step reconstruction Q1, that is:
Although Eq.3 does not strictly follow the gradient of the log-likelihood (Carreira-Perpiñán and Hinton, 2005), learning the Contrastive Divergence was shown to be good at learning useful representations in various classification tasks (e.g. Hinton and Salakhutdinov, 2006). Which is the main reason why we used this algorithm as a first attempt!
Despite its simplicity, Eq.3 requires that all four quantities (i.e. the input and output activities under the two different distributions) be available at the same time for one iteration of learning. This leads to subtle difficulties when implemented on a software architecture – such as Cog – in which “transmission” delays are present.
Fig.2 shows how delay lines can be used to synchronize the four activities needed by Eq.3. Here, the two-layer Boltzmann machine is in fact implemented as a four-layer network, whose activities are denoted respectively as I, y, r and z. I and r correspond to the activity of the input layer when an input data vector is respectively clamped and reconstructed. y and z correspond to the activity of the hidden layer computed from the input data and the reconstructed data, respectively. Each of the four layers is assigned a delay line of specific length (1, 2, 3 and 4 respectively for z, r, y and I). The activity of a given layer x at a particular step of its delay line in response to the vector presented at time t is indicated in Fig.2 as x(t). Unlike the activity of the various layers, a single weight matrix w is used throughout all computations.
Fig.2. Overview of the implementation of the Contrastive Divergence algorithm on Cog.
As shown in Fig.2, the net effect of the delay lines is to synchronize the respective activities prior to a weight update.
To verify that the proposed delay lines lead to a suitable implementation of the Contrastive Divergence algorithm, one RBM was simulated on MNIST data. Fig.3 shows the resulting weights, which are comparable to those reported in previous reported implementations (Hinton, 2002).
Fig.3. Weights learned in a single RBM trained on MNIST data.
Given this initial success in learning digits, we decided to implement more realistic object recognition. The video below shows an RBM network successfully learning 8 objects in Cog Ex Machina. The scorekeeper indicates percent correct. The field "BP" is the ouptut of the network, and should match the desired activity shown in the field to its right (shown as "De"). Sample images and boundaries extracted (with some delay, which means the images don't match in this display) are shown at the top.
Finally, and more crucially, transferring this to the robot. We have chosen our favorite platform, the iRobot Create, for this task. In this example, the robot is told to rotate until it finds the lizard, one of the objects the robot has been trained to recognize. When the robot recognizes the lizard in its field of view, it goes forward for a few seconds toward it. Training is done here offline with standard backpropagation on eight object classes + a dummy class for the black background. The model is a feedforward deep network (one input layer [i.e. the image], 2 hidden layers, and one output layer). The input size is 160x160 (this is relatively big compared to many published object recognition works), hidden layers's sizes are 30x30, and output layer sizes 9 by 1.