Open-Vocabulary 3D Instruction Ambiguity Detection

School of Electronic and Computer Engineering, Peking University
{jyding25, hrtang}@stu.pku.edu.cn, geli@pku.edu.cn
*Corresponding Author
Ambiguity Example

This high-stakes scenario highlights a critical safety challenge where an ambiguous instruction ("Pass me the vial") forces a robot to choose between a harmless substance and a lethal one.

Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene.

To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous.

To address this challenge, we propose , a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide a vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of , paving the way for safer and more trustworthy embodied AI. Our dataset and code will be made publicly available.

Motivation

The reliability of an agent's interaction with the physical world heavily depends on the precision of its understanding of human instructions. Current research in embodied intelligence has predominantly centered on "grounding" language in vision and subsequent "execution." While successful, this paradigm assumes instructions are unambiguous, introducing significant latent risks.

For example, in a smart home context, if asked "Was the stove in the kitchen turned off?", a model might check the main stovetop and answer "Yes," overlooking a portable induction cooktop still operating in a corner. This creates a false sense of security. Fundamentally, these models are designed to find the "right answer" to an input assumed to be valid, lacking an intrinsic mechanism to identify when the input itself is a "bad question."

A truly reliable system must possess the ability to recognize such ambiguity and proactively seek clarification, rather than blindly guessing. To systematically address this, we introduce the task of Open-Vocabulary 3D Instruction Ambiguity Detection.

The Framework

We propose AmbiVer (Ambiguity Verifier), a two-stage framework that decouples scene perception from logical reasoning.

  • Perception Stage: Converts the raw scene and instruction into a set of structured evidence, geometrically unifying potential referents across multiple views.
  • Reasoning Stage: Passes this structured evidence to a zero-shot VLM for logical adjudication.
AmbiVer Method

Overview of the AmbiVer two-stage framework.

Qualitative Results

We visualize the detection results of AmbiVer across diverse scenarios in the Ambi3D benchmark. Green highlights indicate correctly identified ambiguous targets, while red indicates failure cases or baseline errors.

Qualitative Results

Qualitative results of our AmbiVer framework on the Ambi3D benchmark.

BibTeX

@article{AmbiVer2024,
  title={AmbiVer: Open-Vocabulary 3D Instruction Ambiguity Detection},
  author={Ding, Jiayu and Tang, Haoran and Li, Ge},
  journal={arXiv preprint arXiv:26xx.xxxxx},
  year={2026}
}