Abstract

Traditional Sign Language Recognition (SLR) suffers from the scale limitation of SL datasets, which may lead to over-fitting in narrow context and application. In this paper, to solve the problem, we for the first time propose a Combinational Sign Language Recognition (CombSLR) framework, which can serve as an augmentation to extend existing datasets by combining continuous videos (called Template) and isolated videos (called Entity). The CombSLR framework is trained on combinational SL data (T & E) and applied on continuous SL data. However, due to the unknown combination location and context inconsistency between any T-E pair, naively inserting E into T is infeasible. To tackle this issue, we propose a simple yet effective method named EinT, which contains two main modules: (1) Location Candidate Prediction, to produce a reliable insertion location considering the inter-frame relationship and make the network end-to-end trainable; (2) Feature Insertion via Context Passing, to eliminate context inconsistency between T and E feature. EinT can be easily compatible with the existing SLR models to effectively implement data augmentation at the feature level during training stage. We conduct extensive experiments on multiple publicly available sign language datasets, e.g., CCLS, CSL+DEVISIGN-D and CSL-Daily+DEVISIGN-D. The experimental results show the CombSLR can significantly promote existing SLR methods, e.g., averagely improving by 15.1% on CCLS dataset and 6.4% on CSL dataset for WER metric, which demonstrates the superiority of CombSLR framework.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call