Coarse-to-Fine Visual Question Answering by Iterative, Conditional Refinement
conference paper
Visual Question Answering (VQA) is a very interesting tech nique to answer natural language questions about an image. Recent methods have focused on incorporating knowledge into an improved VQA model, by augmenting the training set, representing scene graphs, or including reasoning. We also leverage knowledge to make VQA more robust. Yet we take a different route: we take the VQA model as-is and extend it with a novel algorithm called Guided-VQA that guides the questioning by leveraging knowledge to obtain better answers. This enables knowledge-extended VQA while not having to retrain the VQA model. This is beneficial when computing resources and/or time to adapt to new knowledge are limited. We start with the observation that VQA has difficulties with answering compositional and finegrained questions. We propose to solve this by a coarse-to-fine scheme of posing ques tions. The proposed Guided-VQA algorithm is an iterative, conditional refinement that decomposes a compositional, finegrained question into a sequence of coarse-to-fine questions by leveraging taxonomic knowledge about the involved objects. On Visual Genome, we show that it improves the answers significantly over standard VQA. This is relevant for robust deployment of VQA where resources or adaptation time are limited.
Topics
TNO Identifier
970942
ISSN
03029743
ISBN
9783031064296
Publisher
Springer Science and Business Media Deutschland GmbH
Source title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 21st International Conference on Image Analysis and Processing, ICIAP 2022, 23 May 2022 through 27 May 2022
Pages
418-428
Files
To receive the publication files, please send an e-mail request to TNO Repository.