Unsupervised Key Hand Shape Discovery of Sign Language Videos with Correspondence Sparse Autoencoders

Recep Doga Siyli,Lale Akarun,Murat Saraclar,Batuhan Gundogdu

doi:10.1109/icassp40776.2020.9053590

Abstract

Recognition of sign language is a difficult task which often requires tedious annotations by sign language experts. End-to-end learning attempts that bypass frame level annotations have achieved some success in limited datasets, but it has been shown that high quality annotations improve performance drastically. Recent unsupervised learning methods using deep neural networks have achieved successes in learning feature extraction. Yet a technique for high quality frame level classification using unsupervised techniques does not exist. In this paper, we assign labels of an isolated Sign Language(SL) dataset using end-to-end neural network architectures that have proven success in unsupervised discovery of sub-word acoustic units in speech processing. We observe that key-hand-shape s(KHS), which are meaningful visual basic parts of signs in a SL dataset can be detected using unsupervised clustering techniques. Sparse autoencoders can successfully retrieve and cluster KHS s used in isolated signs. In addition, using correspondent frames in an autoencoder scheme has the power to continue the learning process.

Full Text