Deakin University
Browse

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

conference contribution
posted on 2025-05-22, 00:46 authored by QH Le, LH Dang, NH Le, Truyen TranTruyen Tran, Thao Minh LeThao Minh Le
Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

History

Volume

39

Pagination

4473-4481

Location

Philadelphia, PA

Open access

  • No

Start date

2025-02-25

End date

2025-03-04

ISSN

2159-5399

eISSN

2374-3468

Language

eng

Publication classification

E1 Full written paper - refereed

Title of proceedings

AAAI-25 : Proceedings of the 39th AAAI Conference on Artificial Intelligence 2025

Event

AAAI Conference on Artificial Intelligence. (39th : 2025 : Philadelphia, PA)

Issue

4

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Place of publication

Washington, DC

Usage metrics

    Research Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC