Attentional bias for hands: Cascade dual‐decoder transformer for sign language production

Xiaohan Ma,Jianming Wang,Rize Jin,Tae‐Sun Chung

doi:10.1049/cvi2.12273

Xiaohan Ma, Jianming Wang + Show 2 more

Open Access

PDF Available

https://doi.org/10.1049/cvi2.12273

Copy DOI

Export

Save

Cite

Journal: IET Computer Vision	Publication Date: Mar 8, 2024
Citations: 1	License type: CC BY 4.0

Affiliation: Ajou University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

AbstractSign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non‐manual information channels. Recent deep learning‐based SLP models directly generate the full‐articulatory sign sequence from the text input in an end‐to‐end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual‐decoder Transformer (CasDual‐Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: Text → Hand pose and SLPsign: Text → Sign pose, utilising an attention‐based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio‐temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state‐of‐the‐art models, and in some cases, achieves considerable improvements over them.

Full Text