Abstract

The task of automatically transcribing general audio data is very different from those usually confronted by current automatic speech recognition systems. The general goal of our work is to determine the optimal training strategy for recognizing such data. Specifically, we have studied the effects of different speaking environments on a phonetic recognition task using data collected from a radio news program. We found that if a singlerecognizer is to be used, it is more effective to use a smaller amount of homogeneous, clean data for training. This approach yielded a decrease in phonetic recognition error rate of over 26% over a system trained with an equivalent amount of data which contained a variety of speaking environments. We found that additional gains can be made with a multiple-recognizer system, trained with environment-specific data. Overall, we found that this approach yielded a decrease in error rate of nearly 2%, with some individual speaking environments’ error rate decreasing by over 7%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call