This paper proposes a new technique to add sign language as a special and real-time feature. In addition to gestures, sign language also uses grammatical rules, linguistic frameworks, etc. We require a method that can extract meaningful sentences from non-manual features. Utilizing techniques from computer graphics, neurologic music therapy, and NN-based image/video formation, this is accomplished. Our goal is to use this to process dynamic images for output generation and real-time classification. Current CNN-based techniques operate by taking the entire video as input, dividing it into layers for the classifier to work on, and then combining and providing the output to the user. Here, Convolutional Deep VGG-16 (CDVGG-16) classifiers adopted for sign feature learning, which is iteratively trained and tested. Their architecture consists of blocks, where each block is composed of 2D Convolution and Max Pooling layers. We prefer VGG-16 over VGG-19 in order to improve feature extraction and decrease overfitting. The captured sign language video processing with some pre-processing and classification steps leads to the extraction of sign language feature space assisting deaf people. As a positive approach, the Sign Language Recognition (SLR) modeling framework with deep learning is introduced as it includes various notable advantages. There are three essential phases for provisioning the SLR framework: (1) data acquisition, (2) image pre-processing, and (3) CDVGG-16 classifier. The simulation is done with the MATLAB 2020a environment with various performance metrics like accuracy, precision, F-measure, recall and running time. The promising results show that real-world SLR applications can be formed using the suggested SLR paradigm. The proposed CDVGG-16 gives 95.6% prediction accuracy, 85.6% precision, 99.9% recall, 96.8% F-measure and 0.125 s running time. For improved performance, our proposed method employs normalization and examines sequential image frames. The results and accuracy indicate that it is a suitable method for video-image classification, as expected, and that deep layers and data augmentation can be used to address any overfitting and underfitting issues. Instead of being a literal translation frame by frame, the filter size is adjusted to ensure that output is translated appropriately given the context.
Read full abstract