Abstract

In low resource settings, with very few hours of training data, state-of-the-art speech recognition systems that require large amounts of task specific training data perform very poorly. We address this issue by building data-driven speech recognition front-ends on significant amounts of task independent data from different languages and genres collected in similar acoustic conditions as the data in the low resource scenario. We show that features derived from these trained front-ends perform significantly better and can alleviate the effect of reduced task specific training data in low resource settings. The proposed features provide a absolute improvement of about 12% (18% relative) in an low-resource LVCSR setting with only one hour of training data. We also demonstrate the usefulness of these features for zero-resource speech applications like spoken term discovery, which operate without any transcribed speech to train systems. The proposed features provide significant gains over conventional acoustic features on various information retrieval metrics for this task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call