File(s) under permanent embargo
Sparse Dense Transformer Network for Video Action Recognition
conference contribution
posted on 2023-02-23, 00:40 authored by X Qu, Z Zhang, W Xiao, J Ran, G Wang, Zili ZhangZili ZhangThe action recognition backbone has continued to advance. The two-stream method based on Convolutional Neural Networks (CNNs) usually pays more attention to the video’s local features and ignores global information because of the limitation of Convolution kernels. Transformer based on attention mechanism is adopted to capture global information, which is inferior to CNNs in extracting local features. More features can improve video representations. Therefore, a novel two-stream Transformer model is proposed, Sparse Dense Transformer Network(SDTN), which involves (i) a Sparse pathway, operating at low frame rate, to capture spatial semantics and local features; and (ii) a Dense pathway, running at high frame rate, to abstract motion information. A new patch-based cropping approach is presented to make the model focus on the patches in the center of the frame. Furthermore, frame alignment, a method that compares the input frames of the two pathways, reduces the computational cost. Experiments show that SDTN extracts deeper spatiotemporal features through input policy of various temporal resolutions, and reaches 82.4% accuracy on Kinetics-400, outperforming the previous method by more than 1.9% accuracy.
History
Volume
13369 LNAIPagination
43-56Publisher DOI
ISSN
0302-9743eISSN
1611-3349ISBN-13
9783031109850Publication classification
E1.1 Full written paper - refereedTitle of proceedings
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Publisher
Springer International PublishingSeries
Lecture Notes in Computer ScienceUsage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC