Abstract

Human activity recognition (HAR) has become a hot research topic, due to its wide application prospect. Fusion-based methods can be used to complement single sensing modality methods. This paper presents the simultaneous utilization of skeleton data and inertial signals—which are captured at the same time using a Kinect depth camera and ten wearable inertial sensors —within a fusion framework, in order to achieve more robust human action recognition, compared to situations where each sensing modality is used individually. Skeleton data captured by the Kinect depth camera are transformed into a weighted front-view skeleton motion map (WF-SMM), a weighted multi-view skeleton motion map (WM-SMM), and a 3D weighted skeleton motion map (3DW-SMM), which are then fed as inputs into a convolutional neural network. Meanwhile, the inertial data are transformed into 2D inertial images, then fed into a 2D dilated convolutional neural network. Two types of fusion are considered: decision-level fusion and feature-level fusion. Experiments were conducted using the publicly available Changzhou University Multimodal Human Action Data set (CZU-MHAD), in which simultaneous skeleton sequence and inertial signals were captured for a total of 22 actions. The results obtained indicate that both the decision- and feature-level fusion approaches generate higher recognition accuracies, compared to the approaches where each sensing modality is used individually. The highest accuracy (of 98.90%) was obtained with the decision-level fusion approach using 3DW-SMM. In addition, some experiments are conducted on the continuous action streams generated based on the CZU-MHAD with different score threshold. And the highest f1 score 82.05 % is obtained with the threshold of 0.4.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call