Abstract
This study introduces a comprehensive approach to enhance the performance of fingerspelling recognition systems in dynamic environments. The methodology begins with spatial feature extraction using MobileNetV3-Small, followed by transformation through a projection layer into a latent space. The Variable-Filter-Length Temporal-Learning Convolutional Neural Network (VTCNN) is then applied to extract both short-range and long-range temporal features, providing a robust representation of dynamic gestures. The recognition system incorporates a shared encoder for both the Connectionist Temporal Classification (CTC) decoder and the attention-based decoder, capitalizing on the unique strengths of each decoder. To address weak supervision challenges, a novel strategy involving supervised contrastive learning (SupCon) during retraining is proposed. Leveraging decoding results from the CTC decoder, an image set with frame labels is constructed, contributing to more efficient differentiation between fingerspelling gestures and improving overall accuracy. The final step involves a joint CTC/attention-based decoding strategy using the beam search algorithm. This approach effectively combines decoder outputs, resulting in superior recognition performance. The synergistic interplay of proposed methods—VTCNN for temporal feature extraction, multi-task learning for leveraging decoder strengths, SupCon for feature clustering refinement, and joint decoding—culminates in a holistic and state-of-the-art fingerspelling recognition system, validated through benchmarking on the ChicagoFSWild and ChicagoFSWild+ datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have