Deakin University
Browse

Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Download (1.98 MB)
Version 2 2025-11-04, 22:22
Version 1 2025-10-28, 03:20
conference contribution
posted on 2025-11-04, 22:22 authored by Tuyen Tran, Thao Minh LeThao Minh Le, Quang-Hung Le, Truyen TranTruyen Tran
Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements’ space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence—a noun-phrase and verb-phrase pair—to direct visual tokens’ self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner’s effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models’ capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach’s potential, especially for complex prompts.

History

Related Materials

  1. 1.

Location

Bologna, Italy

Open access

  • Yes

Language

eng

Volume

413

Pagination

517-524

Start date

2025-10-25

End date

2025-10-30

ISSN

0922-6389

eISSN

1879-8314

ISBN-13

978-1-64368-631-8

Title of proceedings

ECAI 2025 : Proceedings of the 28th European Conference on Artificial Intelligence Including 14th Conference on Prestigious Applications of Intelligent Systems 2025

Event

Artificial Intelligence. Conference (28th : 2025 : Bologna, Italy)

Publisher

IOS Press

Series

Frontiers in Artificial Intelligence and Applications

Usage metrics

    Research Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC