Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Ben Saunders,Necati Cihan Camgoz,Richard Bowden

doi:10.1007/s11263-021-01457-9

Ben Saunders, Necati Cihan Camgoz + Show 1 more

Open Access

https://doi.org/10.1007/s11263-021-01457-9

Copy DOI

Journal: International Journal of Computer Vision	Publication Date: May 7, 2021
Citations: 28	License type: open-access

Affiliation: University of Surrey

Abstract

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate. Sign language production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a mixture density network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions.

Highlights

Sign languages are visual multi-channel languages and the main medium of communication for the Deaf
We present a Continuous 3D Multi-Channel Sign Language Production model, the first Sign language production (SLP) network to translate from spoken language sentences to continuous 3D multi-channel sign language sequences in an end-to-end manner
To overcome the issues of deterministic prediction, we propose the use of a mixture density network (MDN) to model the variation found in sign language

Summary

Introduction

Sign languages are visual multi-channel languages and the main medium of communication for the Deaf. Around 5% of the worlds population experience some form of hearing loss (World Health Organisation 2020). In the UK alone, there are an estimated 9 million people who are Deaf or hard of hearing (British Deaf Association 2020). For the Deaf native signer, a spoken language may be a second language, meaning their spoken language skills can vary immensely (Holt 1993). Sign languages are the preferred form of communication for the Deaf communities. Sign languages possess different grammatical structure and syntax to spoken languages (Stokoe 1980). The translation between spoken and sign languages requires a change in order and structure due to Communicated by Manuel J.

Objectives

Methods

Results

Conclusion