Abstract

Communication between people and machines has been extended in the last two decades. Corresponding techniques have been founded to cover the need of voice understanding, including speech and speaker recognition on a large-scale. In this paper, the authors propose a simplified deep-learning approach to accomplish the large-scale speaker identification task using as little training data as possible. Fisher speech corpus has been explored to select the recordings of unique speakers having sufficient data. The authors are using the MFCC method to represent the feature vectors of a large set of more than 4 k speakers with about 343 h of speech signals. The solution includes omitting the pre-processing and considering longer segments of the voice signals. Various portions of training datasets have been tested, as well as dedicating larger percentages of the used data for testing. Bidirectional LSTM neural networks provided up to 76.9% accuracy rate for individual voice segments, and 99.5% when considering the segments of each speaker as a bundle. Doubling the amount of the training data yielded a perfect accuracy rate of 100%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.