Continuous Sign Language Recognition Based on Pseudo-supervised Learning

Xiankun Pei,Dan Guo,Ye Zhao

doi:10.1145/3347319.3356837

Abstract

Continuous sign language recognition task is challenging for the reason that the ordered words have no exact temporal locations in the video. Aiming at this problem, we propose a method based on pseudo-supervised learning. First, we use a 3D residual convolutional network (3D-ResNet) pre-trained on the UCF101 dataset to extract visual features. Second, we employ a sequence model with connectionist temporal classification (CTC) loss for learning the mapping between the visual features and sentence-level labels, which can be used to generate clip-level pseudo-labels. Since the CTC objective function has limited effects on visual features extracted from early 3D-ResNet, we fine-tune the 3D-ResNet by feeding the clip-level pseudo-labels and video clips to obtain better feature representation. The feature extractor and the sequence model are optimized alternately with CTC loss. The effectiveness of the proposed method is verified on the large datasets RWTH-PHOENIX-Weather-2014.

Full Text