Abstract

Classification of human actions is an ongoing research problem in computer vision. This review is aimed to scope current literature on data fusion and action recognition techniques and to identify gaps and future research direction. Success in producing cost-effective and portable vision-based sensors has dramatically increased the number and size of datasets. The increase in the number of action recognition datasets intersects with advances in deep learning architectures and computational support, both of which offer significant research opportunities. Naturally, each action-data modality—such as RGB, depth, skeleton, and infrared (IR)—has distinct characteristics; therefore, it is important to exploit the value of each modality for better action recognition. In this paper, we focus solely on data fusion and recognition techniques in the context of vision with an RGB-D perspective. We conclude by discussing research challenges, emerging trends, and possible future research directions.

Highlights

  • Human action recognition (HAR) has recently gained increasing attention from computer vision researchers with applications in robot vision, multimedia content search, video surveillance, and motion tracking systems

  • The following subsections discuss the fundamental variants of neural networks, and later we present some modern deep learning-based approaches used in RGB-D data

  • As performance demand relies on high-end hardware and multiple graphical processing units (GPU), support is a must when experimenting with big data-related problems

Read more

Summary

Introduction

Human action recognition (HAR) has recently gained increasing attention from computer vision researchers with applications in robot vision, multimedia content search, video surveillance, and motion tracking systems. The development of low-cost sensors such as Microsoft Kinect [1], Intel RealSense [2], and Orbbec [3] has sparked further research into action recognition These sensors collect data in various modalities such as RGB video, depth, skeleton, and IR. All these modalities have their own characteristics that can help answer challenges related to action data and provide potential opportunities for computer vision researchers to examine vision data from different perspectives. RGB-D data acquisition and different consumer preferred sensors will be discussed in following subsections

RGB-D Data Acquisition
RGB-D Sensors
Classical Machine Learning-Based Techniques
Depth Data-Based Techniques
Skeleton Sequence-Based Techniques
RGB-D Data-Based Techniques
Deep Learning
Neural Networks Variants
Deep Learning-Based Techniques Using RGB-D Data
Single Stream
Two Stream
Hybrid Deep Learning-Based Techniques for HAR
Data Fusion Techniques
Early Fusion
Slow Fusion
Late Fusion
Multi-Resolution
Content-Based Video Summarization
Education and Learning
Healthcare Systems
Entertainment Systems
Safety and Surveillance Systems
Sports
Challenges in RGB-D Data Fusion
Combination of Classical Machine Learning and Deep Learning-Based Methods
Assessment in Practical Scenarios
Self-Learning
Interpretation of Online Human Actions
Multimodal Fusion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call