Abstract

Traditionally, automatic speech recognition (ASR) systems are trained on acoustic representations of neutral speech. As a result, their performance degrades when tested with whispered speech. In this work, we explore the robustness of articulatory features in ASR of neutral and whispered speech. We use acoustic, articulatory, and integrated acoustic and articulatory feature vectors in matched and mismatched train-test cases. The results suggest that the articulatory data is useful in ASR of both neutral and whispered speech, especially in the mismatched train-test cases. When we concatenate acoustic and articulatory feature vectors and deploy it to the mismatched train-test case where the model is trained with neutral speech and tested with whispered speech, a relative improvement in phone error rate of 27.2% is observed compared to when only acoustic features are used. This suggests that articulatory data contains information complementary to acoustic representations. A phone specific recognition error is also presented which illustrates phones where adding articulatory information gives maximum benefit.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call