Abstract

In this study, we propose Mel-weighted single frequency filtering (SFF) spectrograms for dialect identification. The spectrum derived using SFF has high spectral resolution for harmonics and resonances while simultaneously maintaining good time-resolution of some speech excitation features such as impulse-like events. The SFF spectrum can represent speech characteristics such as burst time and glottal closure instances better than the short-time Fourier transform (STFT) spectrum. Our hypothesis is that these intricate representations in the SFF spectrum should help in distinguishing dialects. Therefore, we built a dialect identification system which uses an unsupervised, bottleneck feature representation of the Mel-weighted SFF spectrogram (Mel-SFF spectrogram) with sequence-to-sequence deep autoencoders. The language invariance of the proposed system was evaluated using two datasets: the UT-Podcast database (English) and the STYRIALECT database (German). The proposed representations gave a relative improvement of 9.47% and 4.69% in unweighted average recall (UAR) compared to the best baseline method on the development and test datasets, respectively, of the UT-Podcast database. The proposed representations also gave a comparable performance to the best baseline method for the STYRIALECT database. In addition, the fusion of the autoencoder bottleneck features computed from the Mel-SFF and Mel-STFT spectrograms improved the overall performance indicating complementary information between these features. By further analyzing the performance of the proposed representation with different utterance lengths using the UT-Podcast database, we observed that the proposed representation performed better on short utterances. The improved performance given by the Mel-weighted SFF spectrogram for recognizing dialects in both databases supports our hypothesis.

Highlights

  • In listening to speech, humans analyse the speech signal’s linguistic content but they make conclusions about the speaker’s regional origin, social background and emotional state

  • FOR THE STYRIALECT DATABASE different variants of the single frequency filtering (SFF) spectrogram computation methods described in Section II-B are first investigated to find the best approach for the proposed BNFMel-SFF system in dialect identification

  • We validate the performance of the proposed system and the fusion system with different classifiers (SVM, multi-class logistic regression (MCLR) and Gaussian linear classifier (GLC)) in comparison to the best baseline system obtained from the former analysis

Read more

Summary

Introduction

Humans analyse the speech signal’s linguistic content but they make conclusions about the speaker’s regional origin, social background and emotional state. Dialect identification refers to a research area where the goal is to find the regional origin of the speaker using the temporal and spectral characteristics of his or her speech signal. Each dialect group has its own pronunciation pattern and vocabulary compared to other dialect groups. These variations in speech due to dialect have been shown. To decrease the performance of automatic speech recognition (ASR) systems. An efficient dialect identification system followed by a dialect-specific pronunciation dictionary and a dialect-specific language model can improve the performance of ASR [1]–[3]. Dialect information can be used in speaker profiling in biometrical applications, it can help solve dialect related issues in speaker and language identification, and it can be used in the development of dialect-personalized voice assistants

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.