Abstract

This paper addresses the problem of language modeling for the transcription of broadcast news data. Different approaches for language model training were explored and tested in the context of a complete transcription system. Language model efficiency was investigated for the following aspects: mixing of different training material (sources and epoch); approach for mixing (interpolation vs count merging); and using class-based language models. The experimental results indicate that judicious selection of the training source and epoch is important, and that given sufficient broadcast new transcriptions, newspaper and newswire texts are not necessary. Results are given in terms of perplexity and word error rates. The combined improvements in text selection, interpolation, 4-gram and class-based LMs led to a 20% reduction in the perplexity of the LM of the final pass (3-gram class interpolated with a word 4-gram) compared with the 3-gram LM used in the the LIMSI Nov’97 BN system.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.