Abstract

Two-stream convolutional networks have shown strong performance in a video action recognition task for its ability to capture spatial and temporal features simultaneously. However, the calculation of optical flow is time-consuming and it cannot be applied to the real-time processing of video. To address this problem, this paper proposes a new end-to-end architecture called SpatioTemporal Relation Networks (STRN) to extract spatial information and temporal information simultaneously from the video with the only RGB input. STRN consist of two branches, called appearance stream and motion stream, respectively. Appearance stream retains the structure of the original spatial stream in the two-stream architecture with the input of consecutive frames instead of a single frame. Motion stream, which takes relation information between the adjacent features in the appearance stream as an input, can effectively complement appearance stream. A relation block is an extractor which is used to extract relation information from the appearance stream. STRN can learn spatiotemporal information from the video with the only RGB input, which avoids the calculation of optical flow. We validate the STRN on UCF-101 and HMDB-51 and achieve better performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.