With technological development, human–computer interaction (HCI) has improved, and spoken communication among machines and humans is one solution to enhance and expedite this process. Researchers have recently explored several systems to improve speech and speaker recognition performance in recent decades. A crucial threat in HCI is developing models that can effectually listen and respond like humans. It resulted in the development of the automated speech emotion recognition (SER) method, which can recognize various emotional classes by electing and extracting effectual features from speech signals. The fundamental problem of automated speech detection is the considerable variation in speech signals because of distinct speakers, language differences, speech differences, contents and acoustic conditions, voice modulation differences based on age and gender. With enhancements in deep learning (DL) and the affordability of computational resources, specifically graphical processing units (GPUs), research underwent a paradigm shift. Therefore, this study develops a multi-class automated speech language recognition using natural language processing with optimal deep learning (MASLR-NLPODL) technique. The MASLR-NLPODL technique intends to accomplish the efficient identification of different spoken languages. In the MASLR-NLPODL technique, the initial preprocessing technique involves windowing, frame blocking, and pre-emphasis block. Next, an adaptive time-frequency feature extractor approach utilizing the discrete fractional Fourier transform (DFrFT) was applied, which can be attained by extending the discrete Fourier transform (DFT) with eigenvectors. An improved Harris hawks optimization (IHHO) technique can be employed to select effectual features. Moreover, the classification of spoken languages can be performed by the gated recurrent unit (GRU) model. Finally, the salp swarm algorithm (SSA)-based hyperparameter selection process is involved in enhancing the performance of the GRU model. The design of the IHHO-based feature selection and SSA-based hyperparameter tuning process demonstrates the novelty of the work. The performance evaluation of the MASLR-NLPODL technique takes place under the VoxForge Dataset. The experimental validation of the MASLR-NLPODL technique exhibited a superior accuracy outcome of 96.40% over existing techniques.
Read full abstract