Abstract

Speech-recognition studies rarely use more than a single metric to evaluate recognition performance, usually percent correct (or percent wrong). Such a uni-dimensional evaluation may conceal more than it reveals. An alternative, based on information theory, offers greater insight into brain (and computational) processes associated with human and machine speech recognition. In this presentation, we examine errors associated with phonetic-segment recognition in human listeners and compare them with those committed by automatic speech-recognition (ASR) systems. Consonant errors are analyzed into the phonetic features of VOICING, place—(PLACE) and manner—(MANNER) of—articulation. For both humans and machines, PLACE information is far more vulnerable to distortion/interference than MANNER and VOICING, but is more important for consonant and lexical recognition than the other features. Moreover, PLACE is decoded only after VOICING and MANNER and is more challenging for machines to accurately recognize. The origins of these differences can be traced, in part, to the redundancy with which this information is distributed in the acoustic signal, as well as how the phonetic information is combined across the frequency spectrum. For such reasons, ASR performance could benefit by including phonetic-feature-based information in lexical representations. [Work supported by AFOSR and Technical University of Denmark.]

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call