Wh0

50k

WM-H Episodes

4.7×

Zero-Shot Improvement

18

Real-World Tasks

38.9%

Success Rate (Wh0)

Abstract

Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains.

Overview of Wh0. Top: WM-H provides world-model-generated egocentric manipulation videos with diverse objects, layouts, and hand-object interactions. Middle: WM-H uniquely combines scale with low scene & embodiment gap to deployment; Wh0 converts them to robot-trainable supervision and co-trains with limited robot data atop a human-video-pretrained VLA. Bottom: The resulting policy zero-shot generalizes to unseen tasks, environments, and instructions in real-world manipulation.

WM-H Dataset Construction

Agent 1 Vocabulary Discovery

box hard clock white

Agent 2 Balanced Sampling

Instruction → |

Capture

Qwen-Image-Edit

Insert Objects

Wan-I2V · 4-step LightX2V

Qwen3-VL augmented_text

Human Hand

Qwen-Image-Edit

Robot Hand

HaWoR → MANO 3D Pose

Compute-driven scaling — zero human labor

0 episodes

1k 50k

5.44 GPU-hrs / 1k videos

× 50

0 GPU-hrs total

Scene alignment Embodiment alignment No labor Scale with GPUs

Instruction Generation

A dual-agent LLM system discovers object nouns & adjectives, then preferentially samples under-represented words to compose balanced manipulation instructions.

WM-H data synthesis pipeline. Click any step or use the controls to walk through how raw prompts become 50k robot-trainable egocentric manipulation episodes.

Policy Learning with Human-Robot Alignment

Policy architecture and data composition. A VITRA-style policy denoises actions in the unified MANO space, conditioned on PaliGemma cognition features, FoV, and current hand state. Wh0 co-trains 50k WM-H samples with 400 teleoperated robot demonstrations (28% teleop, 68% WM-H, 4% WM-H w/ Embodiment Alignment).

Real-World Results

Method	Training Setup			Success Rate (%) ↑
Method	Pretraining	Adaptation Data	Strategy	Success Rate (%) ↑
π_0.5	Robot	Teleop	FT	7.78_±15.6
VITRA	Human	Teleop	FT	8.3_±8.6
VITRA Real Version	Human	Teleop + Real Ego	Co-FT	21.4_±23.4
Wh0	Human	Teleop + WM-H	Co-FT	38.9_±19.8

Real-world evaluation and dexterous manipulation performance. Unitree G1 with Inspire hands and a head-mounted egocentric camera (teleop via Vision Pro); evaluation spans unseen objects and one seen plus three unseen backgrounds. We compare different pretraining sources and adaptation data under the same real-robot evaluation protocol. FT denotes fine-tuning on a single adaptation source, while Co-FT denotes joint fine-tuning on multiple data sources.

BibTeX

@misc{wh0_2026,
  title={Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data},
  author={Yangtao Chen and Zixuan Chen and Peiyang Wang and Yong-Lu Li and Jing Huo and Jieqi Shi and Yang Gao},
  note={Under review},
  year={2026}
}

Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

Under Review

Zero-Shot Real-World Dexterous Manipulation

Abstract

WM-H Dataset Construction

Instruction Generation

WM-H

WM-H w/o Scene Alignment

WM-H w/ Embodiment Alignment

Policy Learning with Human-Robot Alignment

Real-World Results

BibTeX