Building an ASR system for noisy environments: SRIs 2001 SPINE evaluation system

Venkata Ramana Rao Gadde,Anand Venkataraman,Dimitra Vergyri,Kemal Sönmez,Jing Zheng,Andreas Stolcke

doi:10.21437/icslp.2002-46

Abstract

We describe SRI’s recognition system as used in the 2001 DARPA Speech in Noisy Environments (SPINE) evaluation. The SPINE task involves recognition of speech in simulated military environments. The task had some unique challenges, including segmentation of foreground speech from noisy background, the need for robust acoustic models to handle noisy speech, and development of language models from limited training data. In developing the SRI evaluation system for this task, we addressed each of these challenges using a combination of state-of-the-art techniques, including several types of feature normalization, model adaptation, class-based language modeling, multi-pass segmentation and recognition, and word posterior-based decoding and system combination We describe SRI’s evaluation system for the October 2001 Speech in Noisy Environments (SPINE) task. The main aim of the paper is to present the key algorithms and components of a state-of-art speech recognition system and how they were combined into a system for optimal performance. We have organized the paper as follows. First, we provide a brief introduction to the SPINE task and its challenges. We then present the key components and algorithms used in our system and how the task features guided the design of the system. We then present the results on two test sets, the dry run and the evaluation sets of the 2001 SPINE evaluation. We conclude with a discussion of the results. SPINE is a relatively new task developed by the Naval Research Laboratory (NRL) to test the state of the art of speech recognition in military noise environments. The primary challenge of the task is recognition of speech with significant amounts of background noise as found in various military environments, such as fighter jet cockpits and aircraft carrier flightdecks. The data consists of dialogs between two participants playing a battleship-like game, with recorded military noises played back into the recording environment. The players use realistic microphones and headgear (e.g., fighter pilot helmets) as appropriate for the different scenarios. The language used comprises a mix of commands, status reports, and confirmations specific to this limited domain, involving an active vocabulary of about 2000 words. More details are available at [1]. Due to its focus on noisy environments, SPINE posed some unique challenges. One of the difffculties was to segment foreground speech from the noisy background, which, in some environments, included background speech. Another challenge was to develop robust acoustic features, models, and techniques capable of recognizing the noise-degraded speech. Yet another challenge was the limited amount of training data, particluarly for training the language model. Thus, the task posed challenges not only for research, but also for system development. To solve these issues of segmentation and robust acoustic and language modeling, we drew on a number of stateof-the-art algorithms as the building blocks of our system. In the following, we describe these components and how they were integrated into a system that achieved the lowest word error rate (WER) in the 2001 SPINE evaluation.

Full Text