Abstract

Articulatory information can improve the performance of automatic speech recognition systems. Unfortunately, since such information is not directly observable, it must be estimated from the acoustic signal using speech-inversion techniques. Here, we first compare five different machine learning techniques for inverting the speech acoustics generated using the Haskins Laboratories speech production model in combination with HLsyn. In particular, we compare the accuracies of estimating two forms of articulatory information (a) vocal tract constriction trajectories and (b) articulatory flesh-point pellet trajectories. We show that tract variable estimation can be performed more accurately than pellet estimation. Second, we also show that estimated tract variables can improve the performance of an autoregressive neural network model for recognizing speech gestures. We compare gesture recognition accuracy for three different input conditions: (1) generated acoustic signal and estimated tract variables, (2) acoustic signal and the original (or groundtruth) tract variables, and (3) acoustic signal only. Results show that gesture recognition accuracy was, not surprisingly, best for condition (2) and worst for condition (3). Importantly, however, condition (1) yielded better performance than (3), demonstrating that estimated tract-variable articulatory information is indeed helpful for automatic speech recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call