Guide · Vision-Language-Action

Collecting Teleoperation Data for Imitation Learning on a Consumer GPU with π0.7

June 9, 2026 · Prometheus Robotics

Teleoperation data is the fuel of modern robot learning — but what you burn it in matters. Train a policy from scratch and it only ever knows the exact task you demonstrated. Fine-tune a pretrained vision-language-action (VLA) model on the same data and you inherit a generalist that already understands objects, scenes, and language — then specialize it to your robot in an afternoon.

π0.7 is the latest in the π (“pi”) series of VLA models from Physical Intelligence, and it is built for exactly this workflow: collect demonstrations, fine-tune, deploy. This guide covers the full pipeline — collecting language-annotated data on the Prometheus humanoid, fine-tuning π0.7 on a single consumer GPU, evaluating it, and deploying it back on hardware.

New to imitation learning? Start with the companion guide, collecting teleoperation data for ACT. It covers the data-collection fundamentals — observation/action spaces, sampling rate, dataset design — and the dataset you record there feeds π0.7 directly.

What π0.7 is

The π series pairs a vision-language model (VLM) backbone with an action expert that generates continuous, high-frequency action chunks via flow matching. In plain terms: the VLM reads the camera images and a natural-language instruction and forms a rich understanding of the scene; the action expert turns that understanding into smooth motor commands. Flow matching — rather than discretizing actions into tokens — is what lets the policy output precise, continuous trajectories at control rate.

Crucially, the model is pretrained on a large, diverse corpus of robot interaction data spanning many embodiments and tasks. Out of the box it already maps “pick up the cup” to plausible reaching and grasping. Each release sharpened that recipe: π0 established the flow-matching VLA, π0.5 pushed open-world generalization to environments it had never seen, and π0.7 is the current checkpoint in that lineage. For your purposes the defining property is constant throughout the series: you do not train it from zero — you fine-tune a capable generalist on a few hours of your own demonstrations.

Why fine-tune a VLA instead of training ACT?

The trade-off is size: π0.7 is a large model, so fine-tuning costs more than ACT. The practical answer is parameter-efficient fine-tuning, covered below — which is what keeps this on a single consumer GPU.

Collecting the data on Prometheus

The collection loop is the same one described in the ACT guide. Prometheus ships with the teleoperation pipeline ready, so you record without integration work:

The one thing ACT doesn’t need: language

Annotate every episode with an instruction. A VLA learns to condition on language, so each demonstration needs the natural-language task it shows (“pick up the cup and place it on the saucer”). Write instructions the way a user actually would, keep phrasing consistent for the same skill, and vary it slightly across skills so the model learns the mapping rather than memorizing one string. Good annotations are what unlock π0.7’s instruction-following at run time.

# Illustrative — record language-annotated teleop episodes
python -m lerobot.record \
    --robot.type=prometheus \
    --teleop.type=quest3s \
    --dataset.repo_id=yourlab/kitchen_tasks \
    --dataset.single_task="put the blue cube in the bin" \
    --dataset.num_episodes=40 \
    --dataset.fps=30

How much data, and how to mix it

Because of pretraining, you often need fewer demos per task than ACT — but coverage still matters. For a multi-task policy, record each skill as its own labelled set (30–50 episodes is a reasonable start) and fine-tune on the mixture. Spread object positions and lighting deliberately; that variation is what the pretrained backbone amplifies into real generalization.

Fine-tuning π0.7 on a single GPU

With language-annotated demonstrations in hand, fine-tuning runs against the released π0.7 checkpoint. The key decision is full fine-tuning versus LoRA (low-rank adaptation):

# Illustrative — LoRA fine-tune of pi0.7 on your dataset
python -m openpi.train pi07_lora \
    --dataset.repo_id=yourlab/kitchen_tasks \
    --lora=true \
    --batch_size=4 \
    --num_steps=30000 \
    --output_dir=outputs/pi07_kitchen

A few practical notes:

Evaluating π0.7

Validate in the bundled simulator against the URDF first, then on hardware. Beyond raw success rate, test the things a VLA is supposed to buy you:

If spatial generalization is weak, you usually need more position variation in the data; if language generalization is weak, diversify your instruction phrasing.

Deploying π0.7 on the robot

At inference the policy consumes the live camera streams, the robot state, and a text instruction, and emits action chunks via flow matching. Two deployment shapes are common on Prometheus:

Because the model predicts chunks, you get smooth motion without re-querying every step, and you can overlap inference with execution to hide latency — the same real-time-chunking idea that keeps action-chunked policies responsive.

ACT or π0.7 — which to use

The best part: the decision isn’t upfront. Collect demonstrations once, in LeRobot format, on Prometheus — the same dataset trains ACT today and fine-tunes π0.7 tomorrow.

The practical recipe

  1. Record 30–50 language-annotated teleop demos per task on Prometheus (VR or leader–follower).
  2. Store them in LeRobot format with correct normalization stats.
  3. LoRA fine-tune π0.7 on a single 24 GB GPU; keep runs short to preserve the pretrained prior.
  4. Evaluate spatial / object / language generalization in sim, then deploy via the SDK.

Model names, capabilities, and commands follow Physical Intelligence’s π-series and the openpi project; check the current openpi documentation for exact checkpoints, training scripts, and hardware requirements before you rely on specific figures.

Run this on a real humanoid

Prometheus ships with the teleoperation pipeline, stereo + wrist cameras, URDF, simulator, and SDK you need to start collecting data on day one.