Abstract

This paper describes our study on developing the text and speech databases for automatic speech recognition of Vietnamese using an available source of linguistic data: the Internet. First, a two-stage procedure is applied to extract a general text corpus which can be used for researches on Vietnamese language such as speech recognition, audio-visual speech recognition, and natural language processing... We also collect another specific text corpus in the field of news and literature using the resource from some main Web sites of Vietnamese. The total text corpus containing 8,681,869 sentences with more than 124 million syllables is then used to build and test the language model for the speech recognizer. Besides, the collecting of speech corpora for experiments on continuous speech recognition and audio-visual speech recognition of Vietnamese are also described.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.