Choosing Wrist Cameras and the Mounting Angle for VLA Policies
Ask anyone who has trained a manipulation policy what mattered most, and surprisingly often the answer is not the model — it is the camera. For vision-language-action (VLA) policies in particular, the wrist camera and the angle it is mounted at quietly decide whether the policy ever learns to grasp reliably. It is one of the cheapest things to get right and one of the most expensive things to get wrong, because a bad angle poisons every demonstration you collect.
This guide explains why the wrist view is so important for VLAs, how to choose the mounting angle in practice, and the lens, calibration, and consistency details that make the difference — using the Prometheus humanoid, whose wrist-camera angle is adjustable precisely for this reason.
Why VLAs lean on the wrist camera
Manipulation policies use two complementary viewpoints:
- Eye-to-hand — a fixed, scene-level view (Prometheus’ head-mounted stereo pair). It gives context: where objects are, how the scene is laid out, what the instruction refers to.
- Eye-in-hand — the wrist camera that travels with the gripper. It gives the close-in detail that decides a grasp: the exact gap between fingertips and object, contact, alignment.
The reason the wrist view matters so much for VLAs is the last few centimetres. As the hand approaches an object, a scene camera’s view of the contact point gets blocked by the arm and gripper. The wrist camera, moving with the hand, keeps the target in frame exactly when precision matters most. Policies trained with a good eye-in-hand view are markedly more robust to object position, because they can servo to what they see rather than memorizing absolute coordinates from a fixed camera.
The head camera tells the policy what and where; the wrist camera tells it how to close the last gap. VLAs use both — but it’s the wrist view that usually separates a 40% success rate from a 90% one on contact-rich tasks.
Choosing the mounting angle
The mounting angle is the tilt of the wrist camera relative to the gripper’s approach axis. There is no single correct number — it depends on your tasks — but the trade-off is consistent:
- Too shallow (camera looking straight ahead, parallel to the fingers): you see far down range but lose sight of the fingertips and the grasp point as you close in. The policy is blind at the moment of contact.
- Too steep (camera looking sharply down at the gripper): you see the fingertips beautifully but lose the target until the hand is almost on top of it, so the approach is under-informed.
- The sweet spot keeps both the fingertips and the target object in frame through the entire approach-and-grasp, with the contact point near the centre of the image at the moment of grasp.
A practical method to pick it
- Pick the hardest grasp in your task set (small object, tight clearance).
- Teleoperate that grasp slowly while watching the live wrist feed. Adjust the Prometheus wrist angle until the fingertips and the object stay visible from the start of the approach all the way to contact.
- Check your other tasks at that angle — a tabletop pick and a shelf reach frame very differently. Find the angle that serves the whole task set, not just one grasp.
- Lock it, write it down, and don’t touch it again.
Consistency beats perfection. A slightly suboptimal angle used for every demonstration trains a fine policy. A great angle that drifts between demos trains nothing — to the model, a changed wrist angle is a different camera, and the dataset becomes inconsistent. This is the single most common way teams quietly ruin a dataset. Fix the angle before episode one and keep it identical through collection, evaluation, and deployment.
Lens, field of view, and image quality
- Field of view. A moderately wide FOV is forgiving — it keeps the target in frame even when your approach isn’t perfectly aligned. Too wide and you add distortion and shrink the object in pixels; too narrow and small motions throw the target out of frame.
- Resolution. Higher isn’t automatically better — most VLAs downsample to a fixed input size, and bigger images cost VRAM during fine-tuning. Match what your policy expects rather than maxing out the sensor.
- Exposure and motion blur. The wrist camera moves fast, so prioritize a high enough frame rate and shutter to avoid motion blur during the approach. Blurry contact frames are exactly the frames the policy needs sharp.
- Latency and sync. The wrist image must be timestamped and aligned with the joint states and head camera. Lag between “what the wrist saw” and “what the arm did” corrupts the action labels — the same synchronization discipline covered in the data-collection guides.
Calibration and what the policy sees
For most imitation-learning setups you feed the wrist image directly and let the policy learn the geometry implicitly — you don’t need perfect extrinsics for a VLA to work. But it pays to know the camera’s pose relative to the gripper, for two reasons: it lets you reproduce the exact framing if hardware is swapped, and it lets you replay or augment data correctly. Record the mounting angle and extrinsics alongside the dataset so the setup is reproducible months later.
One wrist camera or two?
A single wrist camera on the working hand covers most single-arm manipulation. For bimanual tasks — handovers, two-handed assembly — give each arm its own wrist camera so both grasps are observed; a VLA trained on both views coordinates the hands far better than one fed a single viewpoint. On Prometheus this maps onto the Type X manipulator options, which support wrist cameras (and five-finger hands) per arm.
How Prometheus is set up for this
The platform was built with this exact problem in mind:
- Adjustable wrist-camera angle, designed so you can dial in the eye-in-hand view that VLA models like Pi0.5 and π0.7 rely on — then lock it for the whole dataset.
- Head-mounted stereo for the scene-level context that complements the wrist view.
- Synchronized capture through the SDK, so wrist frames, head frames, and joint states share timestamps out of the box.
Common mistakes
- Changing the angle mid-dataset — the number-one silent killer. Lock it first.
- An angle that loses the fingertips at contact — the policy goes blind exactly when it needs to see.
- Motion blur on the approach — raise frame rate / shutter.
- Mismatched framing between training and deployment — deploy with the identical wrist setup you trained on.
Where this fits
Camera placement is upstream of everything else: it shapes the data you collect, which shapes whatever policy you train on it. Once your wrist view is dialed in and locked, you’re ready to collect demonstrations and train — whether that’s ACT from scratch or a fine-tuned π0.7. Get the angle right first, and both work far better.
Run this on a real humanoid
Prometheus ships with the teleoperation pipeline, stereo + wrist cameras, URDF, simulator, and SDK you need to start collecting data on day one.