Abstract
This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert–Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled CNN features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to decompose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy. Furthermore, it works well for complex actions for limited training samples. Simulations show that the proposed approach outperforms the main state-of-the-art methods when applied to four publicly available first-person video datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.