Abstract

Different from other human behaviors, sign language has the characteristics of limited local motion of upper limb and meticulous hand action. Some sign language gestures are ambiguous in RGB video due to the influence of lighting and background color, which affects the recognition accuracy. We propose a multimodal deep learning architecture for sign language recognition which effectively combines RGB-D input and two-stream spatiotemporal networks. Depth videos, as an effective compensation of RGB input, can supply additional distance information about the signer's hands. A novel sampling method called ARSS (Aligned Random Sampling in Segments) is put forward to select and align optimal RGB-D video frames, which improves the capacity utilization of multimodal data and reduces the redundancy. We get the hand ROI by joints information of RGB data for local focus in spatial stream. D-shift Net is proposed as depth motion feature extraction in temporal stream, which fully utilizes three dimensional motion information of the sign language. Both streams are fused by convolutional fusion layer to get complementary features. Our approach explored the multimodal information and enhanced the recognition precision. It obtains the state-the-of-art performance on the datasets of CSL (96.7%) and IsoGD (63.78%).

Highlights

  • With the development of computer vision, research on singleperson behavior recognition has made significant progress

  • This paper mainly studies how to use the latest deep learning method with RGB-D multimodal input to overcome the above difficulties

  • We propose a sign language recognition method based on multimodal two-stream neural network, as illustrated in Fig. 1.The main contributions are as follows: (1) We proposed a sampling method named Align Random Sampling within Segments (ARSS), which sample RGB data extraction spatial features, and sample aligned depth data extraction time features

Read more

Summary

INTRODUCTION

With the development of computer vision, research on singleperson behavior recognition has made significant progress. Combining the spatiotemporal two-stream network method and multimodal data input, if the effect of the feature extraction can be enhanced, the performance index of the sign language recognition system will be greatly improved, and the degree of intelligence of the system will be further strengthened. It will be of great significance for intelligence and application. Optical flow method calculates motion features of objects between sequential frames This feature requires continuous video input, and the huge calculation amount makes the speed of the whole network model significantly reduced.

RELATED WORK
LOCAL FOCUS
TWO-STREAM NEURAL NETWORK FUSION
EXPERIMENTAL RESULTS AND ANALYSIS
DATASETS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call