RoboHiMan: A Hierarchical Evaluation Paradigm for Compositional Generalization in Long-Horizon Manipulation

Under Review

Yangtao Chen*1,4,5, Zixuan Chen*1,3, Nga Teng Chan2, Junting Chen3, Junhui Yin1, Jieqi Shi1, Yang Gao1, Yong-Lu Li†4,5, Jing Huo†1
1Nanjing University, 2The Hong Kong University of Science and Technology, 3National University of Singapore, 4Shanghai Jiao Tong University, 5Shanghai Innovation Institute
*Equal contribution, Corresponding author


Real-world long-horizon manipulation indeed faces compositional generalization challenges under perturbations.


RoboHiMan designs atomic tasks with diverse perturbation factors for both training and evaluation of models.


RoboHiMan designs long-horizon compositional skills with diverse perturbation factors that can
be accomplished through the composition of atomic skills.

Abstract

Enabling robots to flexibly schedule and compose learned skills for novel long-horizon manipulation under diverse perturbations remains a core challenge. Early explorations with end-to-end VLA models show limited success, as these models struggle to generalize beyond the training distribution. Hierarchical approaches, where high-level planners generate subgoals for low-level policies, bring certain improvements but still suffer under complex perturbations, revealing limited capability in skill composition. However, existing benchmarks primarily emphasize task completion in long-horizon settings, offering little insight into compositional generalization, robustness, and the interplay between planning and execution. To systematically investigate these gaps, we propose RoboHiMan, a hierarchical evaluation paradigm for compositional generalization in long-horizon manipulation. RoboHiMan introduces HiMan-Bench, a benchmark of atomic and compositional tasks under diverse perturbations, supported by a multi-level training dataset for analyzing progressive data scaling, and proposes three evaluation paradigms (vanilla, decoupled, coupled) that probe the necessity of skill composition and reveal bottlenecks in hierarchical architectures. Experiments highlight clear capability gaps across representative models and architectures, pointing to directions for advancing models better suited to real-world long-horizon manipulation tasks.

Overview of RoboHiMan



RoboHiMan Overview. To evaluate compositional generalization, RoboHiMan introduces: (a) HiMan-Bench with four task types: atomic (A), atomic-perturbation (AP), compositional (C), and compositional-perturbation (CP). (b) A hierarchical evaluation paradigm with diverse metrics and progressive training data (L1-L4), where L1 uses minimal atomic data and L4 provides larger datasets. (c) Extensive experiments that highlight critical performance gaps across training datasets and evaluation modes, often overlooked by prior benchmarks. Here, the notation “X → Y” denotes training on Level X and evaluation on task category Y.



Overview of Hierarchical Evaluation Protocol



RoboHiMan evaluation paradigm. Three settings are considered: (1) Vanilla - the low-level policy executes tasks directly from instructions without a planner; (2) Decoupled - planner and policy are evaluated separately, using either a rule-based planner (online) or a VLM-based planner (offline); (3) Coupled - a full hierarchical setup where a VLM-based planner generates subgoals online and the low-level policy executes them.



Scaling Effects of Training Data

Scaling atomic-skill data improves atomic performance, and since robust atomic execution is a prerequisite, it also benefits compositional tasks.
Increasing both the quantity and diversity of compositional-skill training data further enhances model performance on compositional tasks. However, the overall success rate remains low, even though all compositional tasks can, in principle, be solved by combining the atomic skills the model has learned.

Atomic and Composition
The performance of
on atomic task

L1

L2

L3

L4

The performance of
on Compositional task

L1

L2

L3

L4

Compositional Task Completion under Disturbances

For compositional tasks with various perturbations, including perturbed compositional-skill data in training effectively improves the model's robustness in performing compositional tasks. In comparison, adding perturbed atomic-skill data alone provides only limited gains in robustness.
Both data design and architectural inductive biases (e.g., keyframe selection, 3D information integration) contribute to improved generalization and robustness.

Radar 1
Radar 2
Radar 3

RVT-2

3D-Diffuser-Actor

Pi0

Pi0.5

The performance for factors

Bottlenecks in Hierarchical Architecturess

Explicit planning is essential, as it supports robust skill composition and underscores the role of hierarchical reasoning in complex long-horizon tasks.
The bottlenecks of hierarchical architectures stem from three main issues: (i) the high-level planner may generate incorrect plans. (ii) the low-level policy may fail during execution. (iii) if the hierarchical system is not properly designed, failures at the high or low level are not effectively handled, leading to error accumulation and eventual task failure.


Vanilla

Rule-based

VLM-based

The performance on Task

BibTeX

@article{chen2025robohiman,
  title={RoboHiMan: A hierarchical evaluation paradigm for compositional generalization in long-horizon manipulation},
  author={Chen, Yangtao and Chen, Zixuan and Chan, Nga Teng and Chen, Junting and Yin, Junhui and Shi, Jieqi and Gao, Yang and Li, Yong-Lu and Huo, Jing},
  journal={arXiv preprint arXiv:2510.13149},
  year={2025}
}