Abstract

Continuous sign language recognition task is challenging for the reason that the ordered words have no exact temporal locations in the video. Aiming at this problem, we propose a method based on pseudo-supervised learning. First, we use a 3D residual convolutional network (3D-ResNet) pre-trained on the UCF101 dataset to extract visual features. Second, we employ a sequence model with connectionist temporal classification (CTC) loss for learning the mapping between the visual features and sentence-level labels, which can be used to generate clip-level pseudo-labels. Since the CTC objective function has limited effects on visual features extracted from early 3D-ResNet, we fine-tune the 3D-ResNet by feeding the clip-level pseudo-labels and video clips to obtain better feature representation. The feature extractor and the sequence model are optimized alternately with CTC loss. The effectiveness of the proposed method is verified on the large datasets RWTH-PHOENIX-Weather-2014.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call