Abstract

This paper investigates methods to model inter-phrase or word context for continuous Japanese speech recognition. It is well known that in continuous speech, coarticulation between words or phrases induces allophonic variation of the beginning and ending phones in words or phrases. It was found that by compiling a network of contextdependent phonetic models which models these inter-word or inter-phrase context, recognition error reduction by 32 % can be achieved compared to models which do not account for inter-word context with task-dependent training, i.e. models that were trained with the same vocabulary as the test set. A more dramatic error reduction of up to 43% was possible with task-independent training. However, this will significantly increase the number of phonetic models required to model the vocabulary. With digit models, the increase in the number of models is 4 to 5 fold. To overcome this increase, we clustered the inter-word/phrase context into a few phonetic classes. Using one class for consonant inter-word context and two classes for vowel context, the recognition accuracy on digit string recognition was found to be virtually equal to the accuracy with unclustered models, while the number of phonetic models required was reduced by more than 50%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.