Motor Babbling | Shane Holmes

Overview

For this project I built a reinforcement learning system that learns inverse kinematics from scratch. A simulated planar robot arm in MuJoCo gets a target position somewhere in its workspace, and has to figure out, on its own, how to actuate its joints to reach it. There is no analytical IK solver in the loop. The agent observes joint angles, velocities, accelerations, end effector position, and target position. It outputs joint torques. Reward is the negative Euclidean distance to the target. Everything else has to be learned.

The point of the project was not to invent a new algorithm; It was to explore the some common RL algorithms and measure them against each other.

The full source is available on GitHub.

Demo

The video above shows a demo, where the target is being moved around in real time with the arrow keys. We can see that the model is attempting to position the robot end effector at the target position.

The Challenge

Inverse kinematics for a 2 link planar arm has a closed form solution. The hard part of this project is that the agent never gets to see the kinematics. It only sees joint torques going out and a scalar reward coming back, and it has to discover, over millions of timesteps, that there is structure connecting the two.

Three things make this difficult.

To start that action space is intrinsically continuous. So for RL methods that can only handle a discrete output space the continuous torque must be quantized to a discrete output space.
The number of degrees of freedom. A 1-DOF arm has a single joint to coordinate, but a 2-DOF arm has two joints whose torques have to be applied together to produce a useful motion. The right torque on joint one depends on what joint two is doing, and vice versa. The agent has to learn the joint policy across both, and the difficulty grows quickly as joints are added.
The target is randomized. Every episode the target appears at a different position in the workspace, so the agent cannot memorize a single trajectory. It has to learn a policy that generalizes across targets, mapping any given target position to the correct sequence of torques. This is the difference between learning one motion and learning to reach.

What I Built

I used the Reacher environment in MuJoCo. The arm is configurable in the number of links, so the same code runs as 1-DOF or 2-DOF. The base hangs from a fixed point, the arm swings in a vertical plane under gravity, and the target is a point in the workspace that the end effector has to reach and hold for 100 consecutive simulation steps. The full target task is a 2-DOF arm reaching to randomized targets.

I started with a Deep Q-Network(DQN). It failed to learn the full 2-DOF task. To check whether DQN itself was capable of solving anything in this environment, I dropped the problem down to 1-DOF and a constant target position, where it converged cleanly. The problem was the discrete action space: DQN has to enumerate actions, and quantizing two continuous joints into 10 levels each gives 100 discrete actions, which was too many for DQN to handle on this task.

That pushed me to try Deep Deterministic Policy Gradient (DDPG), which works directly on continuous action spaces. DDPG learned faster than DQN on the simple 1-DOF task, but training was wildly unstable. Two runs with the same hyperparameters would diverge from each other, the reward curve swung erratically, and on formal evaluation it scored 0% because the result depended entirely on which episode training happened to end on. The algorithm worked in some sense but was not reliable enough to build on.

Finally I moved to Soft Actor Critic (SAC), which addresses DDPG’s stability issues through a stochastic policy, twin critics, and automatic entropy tuning. SAC trained stably on the 1-DOF task, so I scaled the problem back up step by step, carrying the trained checkpoint forward each time: 1-DOF fixed → 1-DOF random targets → 2-DOF random targets. On the full target task it hit roughly 80% success at a 10cm threshold (around 55% at 5cm). The policy reliably gets the end effector into the right region of the workspace, even if it does not always have the precision to settle within the tighter radius.

The final results across the three algorithms:

DQN: 100% success on 1-DOF fixed target, 0% success on 2-DOF
DDPG: highly unstable, 0% success on formal evaluation
SAC: ~80% success on 2-DOF random target

Technical Stack

Layer	Technology
Physics simulation	MuJoCo
Environment API	Gymnasium
Neural networks	PyTorch (CPU only)
Algorithms	DQN, DDPG, SAC
Training script	Single CLI with `--agent` flag and train/eval/play subcommands
Plotting and data	Matplotlib, NumPy
Languages	Python

Takeaways

The biggest thing I took from this project is that RL algorithms are actually pretty decent at learning this kind of kinematics. Even on a small arm, SAC was able to land the end effector in the right region of the workspace most of the time. That makes me curious to push the same setup further: more sophisticated algorithms, more complex robot forms (more joints, 3D motion, different morphologies entirely), and tasks beyond static reaching.

Jan 2025 – Apr 2025 · Final Project