Abstract

Identifying the social background of an unknown speaker by speech accent has multiple applications including in forensic profiling and adaptation of speech recognition. The most effective accent classification models based on phoneme pronunciation require the presence of certain phonemes in the test speech and hence, are applicable only for a longer duration of test samples. On the other hand, the text-independent classifiers disregard the phoneme and linguistic information completely. This paper proposes an ensemble of convolutional neural networks for phoneme-based short-term and text-independent long-term classification of speech regarding speaker background profiling. The model is evaluated by classifying the native language of Indian speakers by their English speech. Both the classifiers within the ensemble complement each other positively; to give higher classification accuracy as compared to classification accuracies obtained from the individual classifiers. Low-pass filtering based speech augmentation has been proven to further improve the classification performance and average accuracy, with up to 79% and 73.7% accuracies achieved for speaker-level and sentence-level tests, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call