Abstract
AbstractSign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non‐manual information channels. Recent deep learning‐based SLP models directly generate the full‐articulatory sign sequence from the text input in an end‐to‐end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual‐decoder Transformer (CasDual‐Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: Text → Hand pose and SLPsign: Text → Sign pose, utilising an attention‐based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio‐temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state‐of‐the‐art models, and in some cases, achieves considerable improvements over them.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have