Reinforcement Learning with Trace Conditioning

By Chris Johnson | March 14, 2012

In the previous post, I introduced the project undertaken last semester. In this post, I will go into further detail on my particular task in the project: reinforcement learning. If you recall, the robot we wish to control is an iRobot Create (a vacuumless Roomba), which we have augmented with a web camera. The camera is able to pan to 150° in either direction from center.

I will first discuss how we abstracted the behavior of the Create, then discuss the learning rules, followed by a numerical simulation to verify the behavior of the rules we used.

Motor control

The iRobot Create locomotes by means of two monoaxial wheels, whose speeds can be manipulated independently, affording the robot circular and straight trajectories. For simplicity, we limit ourselves to straight trajectories (identical wheel velocities), and rotations about the center of the robot (inverted wheel velocities), and we limit our rotations to multiples of 45° . Thus the robot may move in cardinal or ordinal directions, and we fix our coordinate frame so that these behaviors are oriented with respect to the current target location.

This coordinate frame was chosen for conceptual simplicity. In a fixed or egocentric frame, each behavior must be paired with each possible target orientation in order to characterize it as an “approach” or “avoidance” behavior. In contrast, in a target-centered frame, this extra dimension is reduced, and each behavior need only be considered from one orientation.

Action selection involves mapping a target location to a probability distribution of behaviors, schematized in the following figure, and a selection from that distribution. When an action is selected, a minimal rotation is calculated from the current heading and the target heading, executed, followed by a forward motion until the time segment is ended.

In our implementation, rotations occurred with wheel speeds of 500mm/s (the maximum capability of a Create), with 0.25s allowed per 45° traversed, forward motions used wheel speeds of 100mm/s, and three seconds were allotted to the entire motion sequence. Additionally, for simplicity of calculating turn lengths, target location is given a ±22.5° tolerance, resulting in a minor asymmetry when the target does not align with the cardinal or ordinal directions of the Create.


With our behaviors characterized, now, as cardinal and ordinal movements with respect to the target, cues must be decided on, to map from a world-state to a behavior. We have established already that a robot topped with a red ball is an aversive stimulus, and a blue ball is appetitive. In addition, it seems useful to characterize the distance of the target as being near or far; an approach behavior is considerably more indicative of an imminent reward or punishment when the target is near than far. Therefore, our cues consist of the pairs (Red, Blue) × (Near, Far), represented by the digits {0..3}. In our implementation, a stimulus is “near” if (and only if) it is < −30° in elevation, relative to the camera. This leads directly to a simple learning matrix L, where L[b,c] encodes the probability of executing behavior b in the presence of cue c. In addition to our 8 behaviors mentioned above, we add an additional stopping behavior, in which the robot orients to the target and attempts to maintain an estimate of target location until the next behavior is selected.


[; \Delta P(b_n|c_n) = \gamma(c_n)e^{(N-n)\Delta t}P(b_n|c_n)\Delta t ;]


Training is a time-consuming process, so we produced a reduced simulation which assumes that the robot would move the same distance in each of the cardinal and ordinal directions, and that it can always move in one of those directions with respect to the target. These are not strictly true, but allow us to reduce the problem to one dimension, with each action b(p) representing an update of the current distance to the target p. Table 2 shows the explicit forms of these equations, assuming that each behavior moves a distance x in some direction relative to the target.

Beginning with an initial probability distribution [;P_0 (\cdot|p, c);] dependent on target distance p and color c, let [;p_0;] be the initial position, and suppose we choose our units so that the radius of the robot is 1.

[;b_i \sim P(\cdot|p_i,c);]

[;p_{i+1} = b_i(p_i);]

[;T = \min_t p_t < 1;]

Leave a Reply

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>