Deakin University
Browse

File(s) under embargo

Robust visual question answering via semantic cross modal augmentation

Version 2 2024-06-02, 00:13
Version 1 2023-11-06, 00:30
journal contribution
posted on 2023-11-06, 00:30 authored by A Mashrur, W Luo, Nayyar ZaidiNayyar Zaidi, Antonio Robles-KellyAntonio Robles-Kelly
Recent advances in vision-language models have resulted in improved accuracy in visual question answering (VQA) tasks. However, their robustness remains limited when faced with out-of-distribution data containing unanswerable questions. In this study, we first construct a simple randomised VQA dataset, incorporating unanswerable questions from the VQA v2 dataset, to evaluate the robustness of a state-of-the-art VQA model. Our findings reveal that the model struggles to predict the “unknown” answer or provides inaccurate responses with high confidence scores for irrelevant questions. To address this issue without retraining the large backbone models, we propose Cross Modal Augmentation (CMA), a model-agnostic, test-time-only, multi-modal semantic augmentation technique. CMA generates multiple semantically-consistent but heterogeneous instances from the visual and textual inputs, which are then fed to the model, and the predictions are combined to achieve a more robust output. We demonstrate that implementing CMA enables the VQA model to provide more reliable answers in scenarios involving unanswerable questions, and show that the approach is generalisable across different categories of pre-trained vision language models.

History

Journal

Computer Vision and Image Understanding

Volume

238

Article number

103862

Pagination

103862-103862

Location

Amsterdam, The Netherlands

ISSN

1077-3142

eISSN

1090-235X

Language

en

Publisher

Elsevier BV