Abstract

Video data owns strong dynamic features in both spatial and temporal domains. In the literature, 3D Convolutional Neural Networks (3D CNNs) serve as a successful technique to simultaneously learn the spatio-temporal features. However, due to the expensive computations, usually the kernel size of the convolutions used in 3D CNNs is rather small, and thus largely limits their learning capability. To address this issue, in this paper we attempt to enhance the learning capability of 3D CNNs for extracting the dynamic features. We capture the long-distance information by modeling the temporal and spatial features as graphs, and then learn the dynamic graph structure information from the feature maps of 3D CNNs. This corresponds to the powerful Graph Convolutional Networks (GCNs), whose adjacent matrix is determined dynamically based on the feature maps. With the learnt dynamic graph, we introduce and fuse a framewise GCN and a channel-wise GCN to enhance the temporal and spatial feature learning of 3D CNNs. Our proposed spatiotemporal graph convolutional network (STGCN) works as a general module that can be embedded into the popular 3D CNNs architectures (e.g., ResNeXt, P3D). Extensive experiments on two video datasets (UCF-101 and HMDB-51) for action recognition task demonstrate the state-of-the-art models with our STGCN module can achieve significant performance improvement.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.