Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.
History
Volume
39
Pagination
4473-4481
Location
Philadelphia, PA
Open access
No
Start date
2025-02-25
End date
2025-03-04
ISSN
2159-5399
eISSN
2374-3468
Language
eng
Publication classification
E1 Full written paper - refereed
Title of proceedings
AAAI-25 : Proceedings of the 39th AAAI Conference on Artificial Intelligence 2025