Abstract

AbstractAlthough speech recognition has achieved significant success using integrated and efficient models, still some series of challenges remain as linguistic‐acoustic patterns are perturbed by speakers' individual articulation gestures and environmental noises. Due to dynamic changes in the vocal tract cavity, word utterances yield temporal and perturbed linguistic‐acoustic features, whereas vowel utterances yield less‐perturbed quasi‐stationary features. To recognize patterns as in vowels and words, the basic feedforward neural network (NN), among other methods, responds to these vocal tract‐induced variabilities and has shown promising results because of its simple yet effective modelling of nonlinear data. We, therefore, present a comprehensive study on how these variabilities of acoustical features affect the speech token classification performances using NNs. We chose vocal tract resonance (formant frequency) as linguistic‐acoustic feature. Our statistical evaluation of vocal tract‐induced variabilities in seven Bengali vowels and words revealed that words have more variations than vowels. We used four‐fold cross‐validation in an NN with Adam optimizer to compute classification performances using five different metrics. Our experiments found that formant transitions and dispersions do not contribute to classification, and five‐hidden‐layered NN is optimum. In all different test cases, we justified our hypothesis—word classification falls behind vowel classification due to the variability induced by vocal tract dynamics. The optimum NN with 28,263 trainable parameters achieved the highest accuracy and AUC scores: 0.89 and 0.99 in vowels, and 0.64 and 0.91 in words.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call