Speaker Identification Using a Convolutional Neural Network

Suci Dwijayanti,Bhakti Yudho Suprapto,Alvio Yunita Putri

doi:10.29207/resti.v6i1.3795

Suci Dwijayanti, Bhakti Yudho Suprapto + Show 1 more

Open Access

https://doi.org/10.29207/resti.v6i1.3795

Copy DOI

Abstract

Speech, a mode of communication between humans and machines, has various applications, including biometric systems for identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group (CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of 256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy of 98.78%, which is significantly higher than the accuracies of the method involving Mel-frequency cepstral coefficients (MFCCs; 34.62%) and the combination of MFCCs and deltas (26.92%). Overall, CNN-VGG-f with the spectrogram can identify 77 speakers from the samples, validating the usefulness of the combination of spectrograms and CNN in speech recognition applications.

Highlights

Speech is being increasingly used in human–machine interactions for various applications, such as biometric security systems
To evaluate the spectrogram performance in terms of the input features of convolutional neural network (CNN), we evaluated the performance of Mel-frequency cepstral coefficients (MFCCs) and the combination of MFCCs with their delta and delta-delta features, which have previously exhibited satisfactory results in speech recognition [14]
We determined that the VGG-f architecture with a learning rate of 0.001 and batch size of 256, which is deeper and has more layers than a simple CNN architecture, is suitable for speaker identification; it can effectively extract features from spectrograms

Summary

Introduction

Speech is being increasingly used in human–machine interactions for various applications, such as biometric security systems. Since unsecured systems are prone to risks, such as robbery, demolition, and misuse, security plays a major role in the lives of people. Security systems implement methods that employ pattern, pin, or password locks. They exhibit certain flaws as they can be hacked. Security systems based on the identification of human physiological characteristics, namely biometrics, are preferred over the aforementioned methods. Biometric systems use pattern recognition to determine and verify the identification of a person

Methods

Findings

Discussion

Conclusion