LeRobot SO-101 Open-Source Arm
Building and training the flagship open-source SO-101 robotic arm in the Hugging Face LeRobot ecosystem using imitation learning.
Hardware Assembly & Calibration
The SO-101 platform relies on a dual-arm follower/leader configuration designed for precise Human-In-The-Loop data collection. We fully assembled two parallel arm configurations. After overcoming some early hardware turbulence—including a burnt servo and initial ID configuration syncing—the leader arm's teleoperation successfully maps cleanly to the physical follower arm, passing all hardware calibration checks.
Edge Computing & Teleoperation
To ensure a highly embedded, self-contained robotic platform, we transferred the primary teleoperation and inference brain locally onto an NVIDIA Jetson (flashed with Jetpack 6.2). Using the LeRobot environment natively, the Jetson handles real-time synchronization across our multi-camera setup (Side array + Grip array) while bridging joint-state telemetry directly into our dataset directories.
Early Iterations & Lessons Learned
Our initial behavior cloning trials focused on a basic insertion task: placing a block precisely within taped boundaries. We harvested 100 teleoperated episodes, but early model training yielded highly erratic performance. The robot suffered from severe overfitting and out-of-distribution (OOD) freezing—if it encountered a slightly new state, it fundamentally didn't know what to do. Because we verified our hardware calibration was solid, we concluded the issue was a non-diverse dataset. After aggressively expanding our training sweeps and injecting variability, we successfully achieved inference. The arm can now autonomously leverage an ACT architecture to complete the insertion task!
Current Bottlenecks & Next Steps
With a proven data pipeline, our current efforts revolve around unlocking faster onboard training and experimenting with deeper imitation models:
- On-Device Training Optimizations: Running ACT training directly on the Jetson compute is currently severely bottlenecked (scaling incredibly slowly even at 1000 steps). We must establish optimal training conditions for the edge hardware.
- Triangulating Vision Data: To prevent future OOD overfitting, we are physically setting up a third camera angle and expanding our data harvesting to ensure high dataset diversity.
- Transitioning to SmolVLA: With ACT functioning, we are actively researching the deployment of Hugging Face's SmolVLA architecture to enhance generalizability.