Abstract

Various biometric security systems, such as face recognition, fingerprint, voice, hand geometry, and iris, have been developed. Apart from being a communication medium, the human voice is also a form of biometrics that can be used for identification. Voice has unique characteristics that can be used as a differentiator between one person and another. A sound speaker recognition system must be able to pick up the features that characterize a person's voice. This study aims to develop a human speaker recognition system using the Convolutional Neural Network (CNN) method. This research proposes improvements in the fine-tuning layer in CNN architecture to improve the Accuracy. The recognition system combines the CNN method with Mel Frequency Cepstral Coefficients (MFCC) to perform feature extraction on raw audio and K Nearest Neighbor (KNN) to classify the embedding output. In general, this system extracts voice data features using MFCC. The process is continued with feature extraction using CNN with triplet loss to obtain the 128-dimensional embedding output. The classification of the CNN embedding output uses the KNN method. This research was conducted on 50 speakers from the TIMIT dataset, which contained eight utterances for each speaker and 60 speakers from live recording using a smartphone. The accuracy of this speaker recognition system achieves high-performance accuracy. Further research can be developed by combining different biometrics objects, commonly known as multimodal, to improve recognition accuracy further.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call