Abstract

Predicting speech recognizer confusion where utterances can be represented by any combination of text form and audio file. The utterances are represented with an intermediate representation that directly reflects the acoustic characteristics of the utterances. Text representations of the utterances can be directly used for predicting confusability without access to audio file examples of the utterances. First embodiment: two text utterances are represented with strings of phonemes and one of the strings of phonemes is transformed into the other strings of phonemes for a least cost as a confusability measure. Second embodiment: two utterances are represented with an intermediate representation of sequences of acoustic events based on phonetic capabilities of speakers obtained from acoustic signals of the utterances and the acoustic events are compared. Predicting confusability of the utterances according to a formula 2K/(T), K is a number of matched acoustic events and T is a total number of acoustic events.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.