Collecting Teleoperation Data for Imitation Learning on a Consumer GPU with ACT
Imitation learning turns human demonstrations into robot policies. You teleoperate the robot through a task a few dozen times, record what it saw and what it did, and train a model to reproduce the behaviour. The appealing part for most labs: with the right policy, the whole loop — collection and training — fits on a single consumer GPU, no cluster required.
ACT (Action Chunking with Transformers) is the policy that made this practical. It is small, sample-efficient, and trains comfortably on a 24 GB card. This guide is a complete walkthrough: what to record, how to collect clean teleoperation data on the Prometheus humanoid, how to structure the dataset, the hyperparameters that matter, how to evaluate, and how to deploy — all on a single workstation.
Why ACT is the right consumer-GPU baseline
ACT was introduced with the ALOHA project (Zhao et al., 2023) to learn fine-grained bimanual manipulation from low-cost hardware. Three design choices make it both stable and cheap to train:
- Action chunking. Instead of predicting one action at a time, ACT predicts a chunk of the next
kactions (commonlyk≈100, i.e. ~2–3 seconds at 30–50 Hz). Predicting open-loop chunks shortens the effective decision horizon and is the single biggest reason ACT resists the compounding error that wrecks naive behaviour cloning. - Temporal ensembling. At inference, the policy is queried every step, producing overlapping chunks. Their predictions for the same timestep are averaged with an exponential weight, yielding smooth, jitter-free motion without sacrificing reactivity.
- A conditional VAE. Human demonstrations are multimodal — there are many valid ways to grasp a cup. ACT models that with a CVAE: a latent “style” variable absorbs the variation so the Transformer decoder isn’t forced to average conflicting demos into a mushy mean.
The model is a compact Transformer encoder–decoder on the order of tens of millions of parameters. In practice you can train one task on a single RTX 3090/4090 in a few hours from roughly 50 demonstrations. That sample efficiency is exactly why ACT remains the first thing to try before reaching for a large vision-language-action model.
What you actually record
An imitation-learning episode is a synchronized stream of observations and actions sampled at a fixed rate. On Prometheus that maps cleanly onto the onboard sensing and control interface.
Observation space
- Vision. The head-mounted stereo pair for full-scene context, plus the wrist cameras for close-in manipulation detail. The wrist mounting angle is adjustable — pick an angle that keeps the gripper and target in frame, and then never change it during a dataset.
- Proprioception. Joint positions for the arms, grippers, and any active modules. The upper body exposes 19 DoF, rising to 43 with five-finger hands and up to 53 with legs — record every joint the policy is allowed to move.
Action space
The cleanest, most transferable choice is target joint positions (absolute or delta) at the control rate — the same commands you send while teleoperating. Position targets are forgiving: the onboard controllers handle the low-level tracking, and the policy only has to decide where to go, not how to drive each motor.
Sample rate and sync are not optional details. Pick a fixed rate (30 Hz is a good default; 50 Hz for fast, contact-rich tasks) and make sure every camera frame and joint reading shares a timestamp. Misaligned images and actions are the most common silent cause of a policy that “almost works” but never quite grasps.
Setting up teleoperation on Prometheus
Prometheus ships with the teleoperation pipeline out of the box, so there is nothing to build before you can record:
- VR teleop with a Meta Quest 3S — the most intuitive option for full-arm reaching and grasping, and the most natural way to demonstrate manipulation with human-like motion.
- Leader–follower or the manual controller for precise, repeatable trajectories where you want tight control over the path.
Each recorded run becomes one episode. Store episodes in the LeRobot dataset format — the de-facto standard for ACT and most open imitation-learning stacks — so the same data feeds any policy you try later, including a VLA.
# Illustrative — record teleop episodes into a LeRobot dataset
python -m lerobot.record \
--robot.type=prometheus \
--teleop.type=quest3s \
--dataset.repo_id=yourlab/pick_place_cube \
--dataset.num_episodes=50 \
--dataset.fps=30
Designing a dataset that actually trains
This is where most projects succeed or fail, long before any GPU is involved.
- ~50 demos per task is a sensible starting point for ACT. Add more only if evaluation says you need it.
- Vary what should generalize. Deliberately spread object positions, orientations, and lighting across the demos. If every cube starts in the same spot, the policy memorizes the spot, not the task.
- Keep fixed what should not change. Camera framing, wrist angle, table height, and frame rate stay identical across the whole dataset.
- Include recoveries. Let the gripper miss and then correct, on purpose, in a fraction of the demos. A policy that has only seen flawless runs freezes the moment it drifts off the demonstrated trajectory; one that has seen recoveries knows how to get back on track.
- One skill per dataset. Keep tasks separable. “Pick and place a cube” and “open a drawer” are two datasets, not one blurry mixture.
Quality beats quantity, decisively: 50 clean, varied demonstrations train a better ACT policy than 200 sloppy, near-identical ones.
How the dataset is structured
A LeRobot dataset is a set of episodes, each a table of synchronized frames: image tensors per camera, a state vector (joint positions), an action vector (target joints), plus timestamps and episode boundaries. Alongside it, the format stores normalization statistics — per-dimension mean/std for states and actions. ACT relies on these stats to whiten inputs and outputs; if they are wrong (for example computed over a different robot), training silently underperforms. Recompute them whenever your action space changes.
Training ACT on one GPU
Once the dataset exists, training is a single command. ACT reads the multi-camera images and the state vector, and learns to output action chunks:
# Illustrative — train ACT from the recorded dataset
python -m lerobot.train \
--policy.type=act \
--dataset.repo_id=yourlab/pick_place_cube \
--policy.chunk_size=100 \
--policy.n_action_steps=100 \
--batch_size=8 \
--steps=100000 \
--output_dir=outputs/act_pick_place
The hyperparameters worth understanding:
- Chunk size (
k). Longer chunks give smoother, more committed motion but react more slowly to surprises. 100 at 30 Hz is a strong default; shorten it for tasks that need fast corrections. - KL weight. Balances the CVAE latent against reconstruction. Too high and the policy ignores the latent and averages demos; too low and it overfits demonstrator quirks.
- Image resolution and batch size. The two main VRAM knobs. On 24 GB you can run comfortably; on 12 GB, drop resolution and batch size and train a bit longer.
- Steps. ~100k is typical for a single task; watch the validation loss rather than fixing a number.
On a 24 GB consumer card this finishes in a few hours. No multi-node setup, no rented A100s — which is the whole point.
Evaluate in simulation first
Before touching hardware, load the policy against the bundled simulator and the included URDF. Run dozens of rollouts with randomized object placement and measure a real success rate, not just training loss — a low loss with a 10% success rate usually means a data problem (insufficient variation, misaligned timestamps, or wrong normalization). This is also where you tune temporal ensembling: more aggressive averaging for smoothness, less for reactivity.
Deploying the policy on the robot
Inference is light. ACT runs in real time on the onboard compute (Raspberry Pi 5 / NVIDIA Jetson) or a tethered laptop, streaming action chunks to the robot through the Prometheus SDK and its simple REST API. Because you validated against the simulator and URDF first, the sim-to-real gap stays small and the first hardware rollout is rarely a surprise.
Common failure modes — and the fix
- The policy freezes or idles. Almost always missing recovery data, or a distribution shift from a changed camera angle. Add recoveries; restore the exact framing.
- Jittery motion. Increase temporal ensembling, or revisit whether your action rate matches what you trained on.
- Works in sim, fails on hardware. Check lighting and camera exposure differences and confirm the control rate is identical to data collection.
- Grasps the right spot but at the wrong height. Usually stale or mismatched normalization statistics — recompute them.
When to graduate to a VLA
ACT is excellent at a specific task it has been shown. The moment you need a policy that follows language instructions, or generalizes across objects and scenes it never saw demonstrated, you want a pretrained vision-language-action model that you fine-tune on this same teleoperation data. The dataset you collected here transfers directly — same robot, same format, a far more capable result.
That is the subject of the companion guide on fine-tuning π0.7.
Run this on a real humanoid
Prometheus ships with the teleoperation pipeline, stereo + wrist cameras, URDF, simulator, and SDK you need to start collecting data on day one.