Environment Question Answering by Structured Scene Descriptions via Vision-Language Integration for Autonomous Systems
conference paper
Effective environment representation is critical for autonomous systems to interpret and interact with the world. Recent advancements in vision-language models (VLMs) enable advanced understanding of the environment through verifiable structured output, offering powerful means to capture both semantic and contextual information about an environment. In this paper, we propose to leverage state-of-the-art VLMs to create comprehensive and interpretable environment. Our proposed approach translates visual observations into detailed structured semantic scene descriptions. These descriptions can be grounded spatially to represent large environments, potentially enabling their use as contextual input for large language models (LLM) for tasks such as planning, exploration and environment question answering. We demonstrate qualitative examples on the first part of this pipeline, where we illustrate the transformation of visual information into rich structured information for decision makers and LLM context. Special emphasis is placed upon examples from the security domain, including medical emergencies, chemical hazards, and fire incidents. Our preliminary findings underscore the considerable potential of vision-language integration as an essential tool for advanced environmental understanding and decision-making in diverse operational contexts.
Topics
TNO Identifier
1020249
Publisher
SPIE
Source title
Autonomous Systems for Security and Defence II Proceedings of SPIE 15–16 September 2025 Madrid, Spain
Editor(s)
Kampmeijer, L.
Masini, B.
Milosevicl, Z.
Masini, B.
Milosevicl, Z.
Files
To receive the publication files, please send an e-mail request to TNO Repository.