Abstract

We have recently developed a new model of human speech recognition, based on automatic speech recognition techniques [1]. The present paper has two goals. First, we show that the new model performs well in the recognition of lexically ambiguous input. These demonstrations suggest that the model is able to operate in the same optimal way as human listeners. Second, we discuss how to relate the behaviour of a recogniser, designed to discover the optimum path through a word lattice, to data from human listening experiments. We argue that this requires a metric that combines both path-based and word-based measures of recognition performance. The combined metric varies continuously as the input speech signal unfolds over time. The SPEech-based Model of human speech recognition (SpeM [1]) is based on procedures and techniques used in automatic speech recognition (ASR), but attempts to account for the performance of human listeners. SpeM therefore implements the same core theoretical assumptions about human speech recognition (HSR) as are implemented in the HSR model Shortlist [2,3]. SpeM is an advance on Shortlist in at least two ways (see [1] for further details). First, SpeM can take real speech as input, while the input of Shortlist consists of an error-free string of discrete phonemes. Second, SpeM can deal with the pronunciation variants in real speech caused by processes such as insertion and deletion. The lexical search process in Shortlist is unable to deal with a mismatch between the number of phones in the input and the number of phones stored in the canonical pronunciations stored in the Shortlist lexicon. In the present paper, we show that SpeM is able to account for key aspects of human listening ability. We compare its performance to that of the Shortlist model, and show that SpeM, like Shortlist before it, can recognise the words in stretches of speech that are lexically ambiguous. Most data on human spoken word recognition involves measures of how quickly or accurately words can be identified. A central requirement of any model of human speech recognition is therefore that it should be able to provide a continuous measure (usually referred to as ‘activation’ in the psychological literature) of how easy each word will be for listeners to identify. We address the problem of relating the performance of a path-based model of continuous speech recognition to word-based data from psycholinguistic experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call