Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data

Under Review

Yangtao Chen*1,2, Zixuan Chen*2, Peiyang Wang*2, Yong-Lu Li†1,3, Jing Huo†2, Jieqi Shi2, Yang Gao2
1 Shanghai Innovation Institute 2 Nanjing University 3 Shanghai Jiaotong University
* These authors contributed equally to this work. Corresponding author.

Wh0 uses generative video world models to synthesize WM-H, a 50k-episode dataset of egocentric human-hand manipulation videos, and co-trains it with limited robot data to unlock dexterous manipulation capabilities in pretrained VLA models — improving zero-shot success from 8.3% to 38.9% across 18 real-world tasks.

50k
WM-H Episodes
4.7×
Zero-Shot Improvement
18
Real-World Tasks
38.9%
Success Rate (Wh0)

Zero-Shot Real-World Dexterous Manipulation

Abstract

Scaling dexterous manipulation requires generalization across objects, scenes, and tasks, yet existing data sources face a trade-off between scale and scene/embodiment alignment: teleoperation data is well aligned with robot deployment but expensive to collect; simulation is scalable but limited by the sim-to-real gap; and real egocentric videos scale effectively but remain misaligned with robot deployment. We propose Wh0, a framework that uses generative video world models as scalable and controllable sources of egocentric human-hand manipulation data to unlock the manipulation capabilities of pretrained dexterous VLA models. Conditioned on language, objects, and scenes, Wh0 uses a generative world model to produce WM-H, a 50k-episode dataset of egocentric human-object interaction videos. Wh0 then converts the generated videos into robot-trainable supervision through hand motion reconstruction and visual editing. Co-trained with a limited amount of real robot data, WM-H adapts pretrained VLA models to dexterous manipulation deployment. Across 18 real-world dexterous manipulation tasks, compared with a model post-trained only on robot data, Wh0 improves zero-shot success on unseen tasks from 8.3% to 38.9%. Ablation studies further show that scalable generation and scene/embodiment alignment are key drivers of performance gains.

Wh0 overview

Overview of Wh0. Top: WM-H provides world-model-generated egocentric manipulation videos with diverse objects, layouts, and hand-object interactions. Middle: WM-H uniquely combines scale with low scene & embodiment gap to deployment; Wh0 converts them to robot-trainable supervision and co-trains with limited robot data atop a human-video-pretrained VLA. Bottom: The resulting policy zero-shot generalizes to unseen tasks, environments, and instructions in real-world manipulation.

WM-H Dataset Construction

Agent 1 Vocabulary Discovery
box hard clock white
Agent 2 Balanced Sampling
Instruction → |
Robot workspace background Capture
Qwen-Image-Edit
Scene with inserted objects Insert Objects
Wan-I2V · 4-step LightX2V
Qwen3-VL augmented_text

Human Hand
Qwen-Image-Edit
Robot Hand
HaWoR → MANO 3D Pose
Compute-driven scaling — zero human labor
0 episodes
1k 50k
5.44 GPU-hrs / 1k videos
× 50
0 GPU-hrs total
Scene alignment Embodiment alignment No labor Scale with GPUs

Instruction Generation

A dual-agent LLM system discovers object nouns & adjectives, then preferentially samples under-represented words to compose balanced manipulation instructions.

WM-H data synthesis pipeline. Click any step or use the controls to walk through how raw prompts become 50k robot-trainable egocentric manipulation episodes.

Policy Learning with Human-Robot Alignment

Policy architecture

Policy architecture and data composition. A VITRA-style policy denoises actions in the unified MANO space, conditioned on PaliGemma cognition features, FoV, and current hand state. Wh0 co-trains 50k WM-H samples with 400 teleoperated robot demonstrations (28% teleop, 68% WM-H, 4% WM-H w/ Embodiment Alignment).

Real-World Results

Real-world evaluation setup
Method Training Setup Success Rate (%) ↑
Pretraining Adaptation Data Strategy
π0.5 Robot Teleop FT 7.78±15.6
VITRA Human Teleop FT 8.3±8.6
VITRA Real Version Human Teleop + Real Ego Co-FT 21.4±23.4
Wh0 Human Teleop + WM-H Co-FT 38.9±19.8

Real-world evaluation and dexterous manipulation performance. Unitree G1 with Inspire hands and a head-mounted egocentric camera (teleop via Vision Pro); evaluation spans unseen objects and one seen plus three unseen backgrounds. We compare different pretraining sources and adaptation data under the same real-robot evaluation protocol. FT denotes fine-tuning on a single adaptation source, while Co-FT denotes joint fine-tuning on multiple data sources.

BibTeX

@misc{wh0_2026,
  title={Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data},
  author={Yangtao Chen and Zixuan Chen and Peiyang Wang and Yong-Lu Li and Jing Huo and Jieqi Shi and Yang Gao},
  note={Under review},
  year={2026}
}