Abstract

Neural Sign Language Translation (SLT), which is an important cross-modal task to bridge the communication gap between the deaf and the hearing people, has attracted great attention in the field of artificial intelligence, computer vision, and multimedia, etc. Although some great progress has been achieved recently, current neural SLT models still suffer from translation errors caused by under-consideration of non-manual features such as facial expressions, which can carry critical information during communication among the deaf. This paper aims to enhance the traditional neural SLT models by highlighting the facial expression information in the CNN-based sign video representing part. Two novel schemes have been proposed. The first scheme is based on a Multi-stream Architecture, which extracts and represents the facial expression information in an additional stream and aggregates it with the information from the main stream. The second scheme is a pre-trained scheme based on Regions of Interest (RoIs), which first trains a multi-region detection module for recognizing the features of faces and bodies and then transfers the pre-trained parameters to the module in SLT model. To validate the proposed models, we conducted the experiments upon the publicly available SLT benchmark dataset: RWTH-PHOENIX-Weather-2014T. Experimental results showed that both the above-mentioned schemes can improve the performance of SLT models. Especially, the RoIs-based scheme can achieve an improvement up to 1.6+ BLEU-4 score gains, while the multi-stream scheme quantitatively analyzed the importance of the face mainly through flexible components, providing a sufficient theoretical basis for the RoIs-based scheme.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call