Abstract

{utsuro,tetsu}@cl.ics.tut.ac.jp, {nisizaki,nakagawa}@slp.ics.tut.ac.jpABSTRACTFor many practicalapplications ofspeech recognition systems, it isquite desirable to have an estimate of confidence for each hypoth-esized word. Unlike previous works on confidence measures, thispaper studies features for confidence measures that are extractedfrom outputs of more than one LVCSR models. More specifically,this paper experimentally evaluates the agreement among the out-puts of multiple Japanese LVCSR models, with respect to whetherit is effective as an estimate of confidence for each hypothesizedword. The results of experimental evaluation show that the agree-ment between the outputs with two LVCSR models with differ-ent decoders and acoustic models can achieve quite reliable con-fidence. Furthermore, among various features of acoustic modelsbased on Gaussian mixture HMMs, it is concluded that ones suchas whether or not to have short pause models, as well as differentunits in HMMs (e.g., triphone model or syllable model) are themost effective in achieving highly reliable confidence.1. INTRODUCTIONSince current speech recognizers’ outputs are far from perfect andalways include a certain amount of recognition errors, it is quitedesirable to have an estimate of confidence for each hypothesizedword. This is especially true for many practical applications ofspeech recognition systems such as word selection for unsuper-vised adaptation schemes, automatic weighting of additional, non-speech knowledge sources, keyword based speech understanding,and recognition error rejection – confirmation in spoken dialoguesystems.Most of previous works on confidence measures (e.g., [1, 2])are based on features available in a single LVCSR model. How-ever, it is well known that a voting scheme such as ROVER (Rec-ognizeroutputvoting error reduction) forcombining multiple speechrecognizers’ outputs can achieve word error reduction [3, 4]. Con-sidering the success of a simple voting scheme such as ROVER, italso seems quite possible to improve reliability of previously stud-ied features for confidence measures by simply exploiting morethan one speech recognizers’ outputs. From this observation, un-like those previous works on confidence measures, this paper stud-ies features for confidence measures that are extracted from out-puts of more than one LVCSR models.For the purpose of estimating confidence for each hypothe-sized word, it is more important to examine which combination ofexisting LVCSR models can achieve high confidence and whichcombination can not, although even simple voting schemes canachieve word error reduction. Therefore, in this paper, we exper-imentally evaluate the agreement among the outputs of multipleJapanese LVCSR models, with respect to whether it is effectiveas an estimate of confidence for each hypothesized word. In thisevaluation of existing Japanese LVCSR models, we concentrate onevaluating confidence of the agreement among outputs with dif-ferent decoders and/or different acoustic models. The results ofexperimental evaluation show that the agreement between the out-puts with two LVCSR models with different decoders and acousticmodels can achieve quite reliable confidence. Furthermore,amongvarious features of acoustic models based on Gaussian mixtureHMMs, it is concluded that ones such as whether or not to haveshort pause models, as well as different units in HMMs (e.g., tri-phone model or syllable model) are the most effective in achievinghighly reliable confidence. It is also shown that it is better to com-bine various features including those most effective ones than touse one of those most effective features alone.2. SPECIFICATION OF JAPANESE LVCSR SYSTEMS2.1. DecodersAs the decoders of Japanese LVCSR systems, we use the onenamed Julius, which is provided by IPA Japanese dictation freesoftware project [5], as well as the one named SPOJUS [6], whichhas been developed in our laboratory. Both decoders are com-posed of two decoding passes, where the first pass uses the wordbigram, and the second pass uses the word trigram. Julius is withword-trellis searches and hence has much broader search spacethan SPOJUS, which is with N-best searches.2.2. Acoustic ModelsThe acoustic models of Japanese LVCSR systems are based onGaussian mixture HMM. We evaluate phoneme-based HMMs aswell as syllable-based HMMs.2.2.1. Acoustic Models with the Decoder J

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.