Collecting Teleoperation Data for Imitation Learning on a Consumer GPU with π0.7
Teleoperation data is the fuel of modern robot learning — but what you burn it in matters. Train a policy from scratch and it only ever knows the exact task you demonstrated. Fine-tune a pretrained vision-language-action (VLA) model on the same data and you inherit a generalist that already understands objects, scenes, and language — then specialize it to your robot in an afternoon.
π0.7 is the latest in the π (“pi”) series of VLA models from Physical Intelligence, and it is built for exactly this workflow: collect demonstrations, fine-tune, deploy. This guide covers the full pipeline — collecting language-annotated data on the Prometheus humanoid, fine-tuning π0.7 on a single consumer GPU, evaluating it, and deploying it back on hardware.
New to imitation learning? Start with the companion guide, collecting teleoperation data for ACT. It covers the data-collection fundamentals — observation/action spaces, sampling rate, dataset design — and the dataset you record there feeds π0.7 directly.
What π0.7 is
The π series pairs a vision-language model (VLM) backbone with an action expert that generates continuous, high-frequency action chunks via flow matching. In plain terms: the VLM reads the camera images and a natural-language instruction and forms a rich understanding of the scene; the action expert turns that understanding into smooth motor commands. Flow matching — rather than discretizing actions into tokens — is what lets the policy output precise, continuous trajectories at control rate.
Crucially, the model is pretrained on a large, diverse corpus of robot interaction data spanning many embodiments and tasks. Out of the box it already maps “pick up the cup” to plausible reaching and grasping. Each release sharpened that recipe: π0 established the flow-matching VLA, π0.5 pushed open-world generalization to environments it had never seen, and π0.7 is the current checkpoint in that lineage. For your purposes the defining property is constant throughout the series: you do not train it from zero — you fine-tune a capable generalist on a few hours of your own demonstrations.
Why fine-tune a VLA instead of training ACT?
- Generalization. Because π0.7 starts from broad pretraining, it transfers to new object positions, lighting, and even objects it never saw in your demos — far beyond what a from-scratch policy trained on 50 demonstrations can do.
- Language conditioning. You can tell it what to do — “put the blue cube in the bin”, “wipe the table” — and a single checkpoint can cover a family of related tasks instead of one frozen behaviour.
- Fewer demos per skill. Pretraining does the heavy lifting, so each new task needs fewer demonstrations to reach a useful success rate.
- A path to dexterity. With five-finger hands (Prometheus Type X manipulators), the richer action space benefits more from a pretrained generalist than from learning everything from scratch.
The trade-off is size: π0.7 is a large model, so fine-tuning costs more than ACT. The practical answer is parameter-efficient fine-tuning, covered below — which is what keeps this on a single consumer GPU.
Collecting the data on Prometheus
The collection loop is the same one described in the ACT guide. Prometheus ships with the teleoperation pipeline ready, so you record without integration work:
- VR teleop with a Meta Quest 3S, or the leader–follower / manual controller for precise trajectories.
- Observations: the head-mounted stereo pair plus the wrist cameras. The adjustable wrist-camera angle was designed with VLA models like the π series in mind — set it once, keep it fixed across the dataset.
- Proprioception and actions: joint states and the target commands you teleoperate, recorded at a fixed rate in the LeRobot dataset format.
The one thing ACT doesn’t need: language
Annotate every episode with an instruction. A VLA learns to condition on language, so each demonstration needs the natural-language task it shows (“pick up the cup and place it on the saucer”). Write instructions the way a user actually would, keep phrasing consistent for the same skill, and vary it slightly across skills so the model learns the mapping rather than memorizing one string. Good annotations are what unlock π0.7’s instruction-following at run time.
# Illustrative — record language-annotated teleop episodes
python -m lerobot.record \
--robot.type=prometheus \
--teleop.type=quest3s \
--dataset.repo_id=yourlab/kitchen_tasks \
--dataset.single_task="put the blue cube in the bin" \
--dataset.num_episodes=40 \
--dataset.fps=30
How much data, and how to mix it
Because of pretraining, you often need fewer demos per task than ACT — but coverage still matters. For a multi-task policy, record each skill as its own labelled set (30–50 episodes is a reasonable start) and fine-tune on the mixture. Spread object positions and lighting deliberately; that variation is what the pretrained backbone amplifies into real generalization.
Fine-tuning π0.7 on a single GPU
With language-annotated demonstrations in hand, fine-tuning runs against the released π0.7 checkpoint. The key decision is full fine-tuning versus LoRA (low-rank adaptation):
- LoRA freezes the base weights and trains small adapter matrices plus the action expert. It cuts trainable parameters and optimizer state dramatically, bringing the memory footprint down to roughly what a single 24 GB consumer GPU (an RTX 4090, say) provides. This is the path for most labs validating a task.
- Full fine-tuning updates all weights — higher ceiling, but the optimizer state pushes you onto a rented A100/H100. Reach for it only when LoRA plateaus.
# Illustrative — LoRA fine-tune of pi0.7 on your dataset
python -m openpi.train pi07_lora \
--dataset.repo_id=yourlab/kitchen_tasks \
--lora=true \
--batch_size=4 \
--num_steps=30000 \
--output_dir=outputs/pi07_kitchen
A few practical notes:
- Watch VRAM, not just parameter count. Activations and image encoders dominate; if you are tight on a 24 GB card, reduce batch size and gradient-accumulate, or trim image resolution before giving up on LoRA.
- Guard against forgetting. Aggressive fine-tuning on a narrow dataset can erode the very generalization you fine-tuned to keep. Co-training with a slice of broader data, or simply keeping the run short with LoRA, helps preserve the pretrained prior.
- Normalization. As with any policy, the action/state statistics must match your robot. Recompute them for the Prometheus action space rather than reusing another embodiment’s.
Evaluating π0.7
Validate in the bundled simulator against the URDF first, then on hardware. Beyond raw success rate, test the things a VLA is supposed to buy you:
- Spatial generalization: object positions you never demonstrated.
- Object generalization: a mug you never showed it, given the same instruction.
- Language generalization: paraphrased instructions (“drop the cube in the container”).
If spatial generalization is weak, you usually need more position variation in the data; if language generalization is weak, diversify your instruction phrasing.
Deploying π0.7 on the robot
At inference the policy consumes the live camera streams, the robot state, and a text instruction, and emits action chunks via flow matching. Two deployment shapes are common on Prometheus:
- Tethered GPU. Run the policy on a workstation GPU and stream commands to the robot over the SDK’s REST API. Simplest, and fine for development and many production cells.
- Onboard. Distil or quantize for the NVIDIA Jetson when you need the robot untethered.
Because the model predicts chunks, you get smooth motion without re-querying every step, and you can overlap inference with execution to hide latency — the same real-time-chunking idea that keeps action-chunked policies responsive.
ACT or π0.7 — which to use
- Reach for ACT when you have one well-defined task, want the cheapest possible training, and don’t need language or generalization. It is the fastest way to a working policy.
- Reach for π0.7 when you need instruction-following, generalization to new objects/positions, or one model covering several related skills — and you can afford a LoRA fine-tune on a 24 GB GPU.
The best part: the decision isn’t upfront. Collect demonstrations once, in LeRobot format, on Prometheus — the same dataset trains ACT today and fine-tunes π0.7 tomorrow.
The practical recipe
- Record 30–50 language-annotated teleop demos per task on Prometheus (VR or leader–follower).
- Store them in LeRobot format with correct normalization stats.
- LoRA fine-tune π0.7 on a single 24 GB GPU; keep runs short to preserve the pretrained prior.
- Evaluate spatial / object / language generalization in sim, then deploy via the SDK.
Model names, capabilities, and commands follow Physical Intelligence’s π-series and the openpi project; check the current openpi documentation for exact checkpoints, training scripts, and hardware requirements before you rely on specific figures.
Run this on a real humanoid
Prometheus ships with the teleoperation pipeline, stereo + wrist cameras, URDF, simulator, and SDK you need to start collecting data on day one.