Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem.

Cai Wingfield,Phil Woodland,Chao Zhang,Elisabeth Fonteneau,Xunying Liu,Andrew Thwaites,William D Marslen-Wilson,Li Su,Jörn Diedrichsen

doi:10.1371/journal.pcbi.1005617

Abstract

There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.

Highlights

A fundamental concern in the human sciences is to relate the study of the neurobiological systems supporting complex human cognitive functions to the development of computational systems capable of emulating or even surpassing these capacities
Both systems may have in common a representation of these regularities in terms of articulatory phonetic features, consistent with an analysis process which recovers the articulatory gestures that generated the speech. These results suggest a possible partnership between human- and machinebased research which may deliver both a better understanding of how the human brain provides such a robust solution to speech understanding, and generate insights that enhance the performance of future Automatic Speech Recognition (ASR) systems
The finding that the commonalities between human and machine systems can be characterised in terms of a biologically plausible low-dimensional analysis, based on articulatory phonetic features, argues for an increased focus on the neurobiological and computational substrate for such an analysis strategy

Summary

Introduction

A fundamental concern in the human sciences is to relate the study of the neurobiological systems supporting complex human cognitive functions to the development of computational systems capable of emulating or even surpassing these capacities. Spoken language comprehension is a salient domain that depends on the capacity to recognise fluent speech, decoding word identities and their meanings from a stream of rapidly varying auditory input. In humans, these capacities depend on a highly dynamic set of electrophysiological processes in speech- and language-related brain areas. These capacities depend on a highly dynamic set of electrophysiological processes in speech- and language-related brain areas These processes extract salient phonetic cues which are mapped onto abstract word identities as a basis for linguistic interpretation. The rapid, parallel development of Automatic Speech Recognition (ASR) systems, with near-human levels of performance, means that computationally specific solutions to the speech recognition problem are emerging, built primarily for the goal of optimising accuracy, with little reference to potential neurobiological constraints

Objectives

Methods

Results

Discussion

Conclusion