Open-Vocabulary 3D Instruction Ambiguity Detection
Abstract
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene.
To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous.
To address this challenge, we propose , a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide a vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of , paving the way for safer and more trustworthy embodied AI. Our dataset and code will be made publicly available.
Motivation
The reliability of an agent's interaction with the physical world heavily depends on the precision of its understanding of human instructions. Current research in embodied intelligence has predominantly centered on "grounding" language in vision and subsequent "execution." While successful, this paradigm assumes instructions are unambiguous, introducing significant latent risks.
For example, in a smart home context, if asked "Was the stove in the kitchen turned off?", a model might check the main stovetop and answer "Yes," overlooking a portable induction cooktop still operating in a corner. This creates a false sense of security. Fundamentally, these models are designed to find the "right answer" to an input assumed to be valid, lacking an intrinsic mechanism to identify when the input itself is a "bad question."
A truly reliable system must possess the ability to recognize such ambiguity and proactively seek clarification, rather than blindly guessing. To systematically address this, we introduce the task of Open-Vocabulary 3D Instruction Ambiguity Detection.
The Framework
We propose AmbiVer (Ambiguity Verifier), a two-stage framework that decouples scene perception from logical reasoning.
- Perception Stage: Converts the raw scene and instruction into a set of structured evidence, geometrically unifying potential referents across multiple views.
- Reasoning Stage: Passes this structured evidence to a zero-shot VLM for logical adjudication.
Overview of the AmbiVer two-stage framework.
Qualitative Results
We visualize the detection results of AmbiVer across diverse scenarios in the Ambi3D benchmark. Green highlights indicate correctly identified ambiguous targets, while red indicates failure cases or baseline errors.
Qualitative results of our AmbiVer framework on the Ambi3D benchmark.
BibTeX
@article{AmbiVer2024,
title={AmbiVer: Open-Vocabulary 3D Instruction Ambiguity Detection},
author={Ding, Jiayu and Tang, Haoran and Li, Ge},
journal={arXiv preprint arXiv:26xx.xxxxx},
year={2026}
}