Hybrid Feature Selection Method Based on Harmony Search and Naked Mole-Rat Algorithms for Spoken Language Identification From Audio Signals

Samarpan Guha,Ram Sarkar,Pawan Kumar Singh,Aankit Das,Ali Ahmadian,Norazak Senu

doi:10.1109/access.2020.3028121

Abstract

This era is dominated by artificial intelligence and its various applications - one of which is Spoken Language Identification (S-LID) which has always been a challenging issue and an important research area in the domain of speech signal processing. This paper deals with S-LID to be used for Human-Computer Interaction (HCI) based applications by attempting to classify various languages from three multi-lingual databases namely CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages, VoxForge and Indian Institute of Technology, Madras (IIT-Madras) speech corpus database by extracting their Mel-Spectrogram features and Relative Spectral Transform - Perceptual Linear Prediction (RASTA-PLP) features. A new hybrid Feature Selection (FS) algorithm have been developed using the versatile Harmony Search (HS) algorithm and a new nature-inspired algorithm called Naked Mole-Rat (NMR) algorithm to select the best subset of features and reduce the model complexity to help it train faster. This selected feature set is fed to five classifiers namely Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), Multi-layer Perceptron (MLP), Naïve Bayes (NB) and Random Forest (RF). The evaluation measures used in this paper are precision, recall, f1-score, classification accuracy and number of selected features. An accuracy of 99.89% on CSS10, 98.22% on VoxForge and 99.75% on IIT-Madras speech corpus databases is achieved using RF. Furthermore, the proposed algorithm is found to outperform 15 standard meta-heuristic FS algorithms. The source code of this work is available at: https://github.com/CodeChef97dotcom/HS-NMR.git.

Highlights

Spoken Language Identification (S-LID) is a process of identifying and classifying a digitized natural spoken language by performing computational linguistic methods on the givenThe associate editor coordinating the review of this manuscript and approving it for publication was Md
CSS10 is a collection of single speaker speech dataset for 10 languages which consists of short audio clips from LibriVox audiobooks
Out of the 17 languages that are available 6 languages have been used namely, ‘‘English’’, ‘‘French’’, ‘‘German’’, ‘‘Italian’’, ‘‘Russian’’, ‘‘Spanish’’. This is because the quality of these audio files are relatively better than others and the length and format of these audio files are appropriate for this experiment

Summary

Introduction

Spoken Language Identification (S-LID) is a process of identifying and classifying a digitized natural spoken language by performing computational linguistic methods on the givenThe associate editor coordinating the review of this manuscript and approving it for publication was Md. Content or data [1]. This classification is made from a set of possible target languages [2], be it from a closed set where all possibilities are known or from an open set with unknown languages included in the test corpora. S-LID has always been a challenging problem owing to the variations in the type of speech input and understanding how human beings comprehend and interpret speech in different conditions [7]. This makes it an important research topic in the field of speech signal processing

Methods

Results

Conclusion