Abstract

In continuous sign language recognition (CSLR) models based on deep learning, hierarchical models are more favored by researchers compared to non-hierarchical models due to their simple model, fewer parameters, and clear hierarchy. In the hierarchical model, spatial feature extraction and temporal feature extraction are separated and there is a sequential order. This paper proposes an end-to-end fully 2D convolutional network (F2DCNet) for CSLR to explore a novel spatial-temporal feature extraction method in hierarchical models. The network consists of 2D-CNN only. After extracting frame level features from sign language videos, the frame level features are concatenated by temporal dimension to form new 2D features, which are later fed into a custom 2D-CNN for spatial-temporal feature extraction, and finally the network is trained using the multi-level connectionist temporal classification (CTC) loss proposed in our previous study. We conduct experiments on two large-scale publicly available continuous sign language datasets, and the experimental results demonstrate the effectiveness of the F2DCNet, achieving highly competitive results against other advanced methods. And the proposed F2DCNet feature extraction method can be applied to other video feature extraction to extract spatial-temporal features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call