Guide · Imitation Learning

Collecting Teleoperation Data for Imitation Learning on a Consumer GPU with ACT

June 9, 2026 · Prometheus Robotics

Imitation learning turns human demonstrations into robot policies. You teleoperate the robot through a task a few dozen times, record what it saw and what it did, and train a model to reproduce the behaviour. The appealing part for most labs: with the right policy, the whole loop — collection and training — fits on a single consumer GPU, no cluster required.

ACT (Action Chunking with Transformers) is the policy that made this practical. It is small, sample-efficient, and trains comfortably on a 24 GB card. This guide is a complete walkthrough: what to record, how to collect clean teleoperation data on the Prometheus humanoid, how to structure the dataset, the hyperparameters that matter, how to evaluate, and how to deploy — all on a single workstation.

Why ACT is the right consumer-GPU baseline

ACT was introduced with the ALOHA project (Zhao et al., 2023) to learn fine-grained bimanual manipulation from low-cost hardware. Three design choices make it both stable and cheap to train:

The model is a compact Transformer encoder–decoder on the order of tens of millions of parameters. In practice you can train one task on a single RTX 3090/4090 in a few hours from roughly 50 demonstrations. That sample efficiency is exactly why ACT remains the first thing to try before reaching for a large vision-language-action model.

What you actually record

An imitation-learning episode is a synchronized stream of observations and actions sampled at a fixed rate. On Prometheus that maps cleanly onto the onboard sensing and control interface.

Observation space

Action space

The cleanest, most transferable choice is target joint positions (absolute or delta) at the control rate — the same commands you send while teleoperating. Position targets are forgiving: the onboard controllers handle the low-level tracking, and the policy only has to decide where to go, not how to drive each motor.

Sample rate and sync are not optional details. Pick a fixed rate (30 Hz is a good default; 50 Hz for fast, contact-rich tasks) and make sure every camera frame and joint reading shares a timestamp. Misaligned images and actions are the most common silent cause of a policy that “almost works” but never quite grasps.

Setting up teleoperation on Prometheus

Prometheus ships with the teleoperation pipeline out of the box, so there is nothing to build before you can record:

Each recorded run becomes one episode. Store episodes in the LeRobot dataset format — the de-facto standard for ACT and most open imitation-learning stacks — so the same data feeds any policy you try later, including a VLA.

# Illustrative — record teleop episodes into a LeRobot dataset
python -m lerobot.record \
    --robot.type=prometheus \
    --teleop.type=quest3s \
    --dataset.repo_id=yourlab/pick_place_cube \
    --dataset.num_episodes=50 \
    --dataset.fps=30

Designing a dataset that actually trains

This is where most projects succeed or fail, long before any GPU is involved.

Quality beats quantity, decisively: 50 clean, varied demonstrations train a better ACT policy than 200 sloppy, near-identical ones.

How the dataset is structured

A LeRobot dataset is a set of episodes, each a table of synchronized frames: image tensors per camera, a state vector (joint positions), an action vector (target joints), plus timestamps and episode boundaries. Alongside it, the format stores normalization statistics — per-dimension mean/std for states and actions. ACT relies on these stats to whiten inputs and outputs; if they are wrong (for example computed over a different robot), training silently underperforms. Recompute them whenever your action space changes.

Training ACT on one GPU

Once the dataset exists, training is a single command. ACT reads the multi-camera images and the state vector, and learns to output action chunks:

# Illustrative — train ACT from the recorded dataset
python -m lerobot.train \
    --policy.type=act \
    --dataset.repo_id=yourlab/pick_place_cube \
    --policy.chunk_size=100 \
    --policy.n_action_steps=100 \
    --batch_size=8 \
    --steps=100000 \
    --output_dir=outputs/act_pick_place

The hyperparameters worth understanding:

On a 24 GB consumer card this finishes in a few hours. No multi-node setup, no rented A100s — which is the whole point.

Evaluate in simulation first

Before touching hardware, load the policy against the bundled simulator and the included URDF. Run dozens of rollouts with randomized object placement and measure a real success rate, not just training loss — a low loss with a 10% success rate usually means a data problem (insufficient variation, misaligned timestamps, or wrong normalization). This is also where you tune temporal ensembling: more aggressive averaging for smoothness, less for reactivity.

Deploying the policy on the robot

Inference is light. ACT runs in real time on the onboard compute (Raspberry Pi 5 / NVIDIA Jetson) or a tethered laptop, streaming action chunks to the robot through the Prometheus SDK and its simple REST API. Because you validated against the simulator and URDF first, the sim-to-real gap stays small and the first hardware rollout is rarely a surprise.

Common failure modes — and the fix

When to graduate to a VLA

ACT is excellent at a specific task it has been shown. The moment you need a policy that follows language instructions, or generalizes across objects and scenes it never saw demonstrated, you want a pretrained vision-language-action model that you fine-tune on this same teleoperation data. The dataset you collected here transfers directly — same robot, same format, a far more capable result.

That is the subject of the companion guide on fine-tuning π0.7.

Run this on a real humanoid

Prometheus ships with the teleoperation pipeline, stereo + wrist cameras, URDF, simulator, and SDK you need to start collecting data on day one.