Abstract

This paper investigates the effect of speaking rate variation on the task of frame classification. This task is indicative of the performance on phoneme and word recognition and is a first step towards designing voice-controlled interfaces. Different speaking rates cause different dynamics. For example, speaking rate variations will cause changes both in formant frequencies and in their transition tracks. A word spoken at normal speed gets recognized more often than the same word spoken by the same speaker at a much faster or slower pace, or vice-versa. It is thus imperative to design interfaces which take into account different speaking variabilities. To better incorporate speaker variability into digital devices, we study the effect of (a) feature selection and (b) the choice of network architecture on variable speaking rates. Four different features are evaluated on multiple configurations of Deep Neural Network (DNN) architectures. The findings show that log Filter-Bank Energies (FBE) outperformed the other acoustic features not only on normal speaking rate but for slow and fast speaking rates as well.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.