Deakin University
Browse

File(s) under permanent embargo

Video captioning with temporal and region graph convolution network

conference contribution
posted on 2020-01-01, 00:00 authored by Xinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang GaoShang Gao, Weiguo Fan
Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets (MSVD and MSR-VTT) show the effectiveness of our model.

History

Pagination

1-6

Location

Online/London, Eng.

Start date

2020-07-06

End date

2020-07-10

ISBN-13

978-1-7281-1331-9

Language

eng

Publication classification

E1 Full written paper - refereed

Editor/Contributor(s)

[Unknown]

Title of proceedings

ICME 2020 : Proceedings of the 2020 IEEE International Conference on Multimedia and Expo

Event

IEEE Computer Society. International Conference (2020 : Online/London, Eng.)

Publisher

Institute of Electrical and Electronics Engineers

Place of publication

Piscataway, N.J.

Series

IEEE Computer Society International Conference

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC