The main challenge of continuous sign language translation (CSLT) lies in the extraction of both discriminative spatial features and temporal features. In this paper, a spatial-temporal feature extraction network (STFE-Net) is proposed for CSLT, which optimally fuses spatial and temporal features, extracted by the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net), respectively. SFE-Net performs pose estimation for the presenters in sign-language videos. Based on COCO-WholeBody, 133 key points are abbreviated to 53 key points, according to the characteristics of the sign language. High-resolution pose estimation is performed on the hands, along with the whole-body pose estimation, to obtain finer-grained hand features. The spatial features extracted by SFE-Net and the sign language words are then fed to TFE-Net, which is based on Transformer with relative position encoding. In this paper, a dataset for Chinese continuous sign language was created and used for evaluation. STFE-Net achieves Bilingual Evaluation Understudy (BLEU-1, BLEU-2, BLEU-3, BLEU-4) scores of 77.59, 75.62, 74.25, 72.14, respectively. Furthermore, our proposed STFE-Net was also evaluated on two public datasets, RWTH-Phoenix-Weather 2014T and CLS. The BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores achieved by our method on the former dataset are 48.22, 33.59, 26.41 and 22.45, respectively, and the corresponding scores are 61.54, 58.76, 57.93 and 57.52, respectively, on the latter dataset. Experiment results show that our model can achieve promising performance. If any reader needs the code or dataset, please email lunfee@whut.edu.cn.
Read full abstract