Abstract

Activity recognition is a fundamental and crucial task in computer vision. Impressive results have been achieved for activity recognition in high-resolution videos, but for extreme low-resolution videos, which capture the action information at a distance and are vital for preserving privacy, the performance of activity recognition algorithms is far from satisfactory. The reason is that extreme low-resolution (e.g., 12 × 16 pixels) images lack adequate scene and appearance information, which is needed for efficient recognition. To address this problem, we propose a super-resolution-driven generative adversarial network for activity recognition. To fully take advantage of the latent information in low-resolution images, a powerful network module is employed to super-resolve the extremely low-resolution images with a large scale factor. Then, a general activity recognition network is applied to analyze the super-resolved video clips. Extensive experiments on two public benchmarks were conducted to evaluate the effectiveness of our proposed method. The results demonstrate that our method outperforms several state-of-the-art low-resolution activity recognition approaches.

Highlights

  • IntroductionIt is crucial to develop intelligent video understanding algorithms for various tasks, such as video recommendation and human activity recognition

  • We propose an extreme low-resolution activity recognition approach aided by a super-resolution generative adversarial network

  • It is true that we can use more advanced variants of generative adversarial network (GAN) to obtain better super-resolution performance, but we restrict our choice to SDSR based on two factors: (1) the basic idea of this manuscript is to propose a framework for extreme low-resolution activity recognition, not a new SR

Read more

Summary

Introduction

It is crucial to develop intelligent video understanding algorithms for various tasks, such as video recommendation and human activity recognition. Many efforts have been made in the field of activity recognition. Typical methods include the two-stream convolution network [1] and C3D [2]. These approaches assume that the provided videos are high-quality and that video regions of human activities are large enough to model spatiotemporal information. In certain situations, such as video surveillance in far-field, where a human is usually very far way from the camera, this assumption is invalid as only low-resolution videos are acquired since the ROI (regions-of-interest) can be extremely tiny in the video frames

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.