Video captioning with temporal and region graph convolution network
conference contribution
posted on 2020-01-01, 00:00authored byXinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang GaoShang Gao, Weiguo Fan
Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets (MSVD and MSR-VTT) show the effectiveness of our model.