Abstract

As speech recognition systems become more accurate, they are used for more diverse applications. These applications often involve populations who never used a recogniser before and for whom the standard data for adult male, adult female, or mixed adult speech is not very representative. This paper will deal with issues concerning the collection and processing of data from those new speaker populations and from speakers of different languages. It deals with data collected for various projects, such as the KIDS database [1] and the Diplomat project [2]. It specifically discusses issues related to obtaining quantitatively and qualitatively sufficient amounts of speech from diverse speaker populations. Since the speech of these individuals is very different from the speech collected in the past, we assume that some hand labelling may be necessary and therefore also address the issue of ameliorating the labelling process. 1. ADAPTATON TO NEW APPLICATIONS As speech recognition systems become more accurate, they are ported to more diverse applications. Changing domains involves changes in many levels of processing. Data obtained in the past has varied from large populations of speakers carefully reading relatively small amounts of text (TIMIT), smaller populations reading larger amounts of text in a defined application domain (DARPA RM), heavily constrained, but not read, speech from a relatively small population (ATIS) to more spontaneous speech in a less restrained domain from a fairly small number of speakers (Broadcast News). When a new application is defined, large amounts of speech data typical of that type of variability are collected for training. The speakers have generally been adult natives. As the data for automatic speech recognizers (ASRs) has changed, each newly-defined hurdle has revealed new datagathering issues. Some of the issues in Broadcast News concerned obtaining the broadcast signal and choosing a subset of all that is broadcast. Once the signal was recorded, other issues surfaced, such as segmenting the signal into usable chunks. With new populations of users, such as children, other issues have come up. The information drawn from our new populations will hopefully aid the reader in preparing to deal with yet other populations in the future, and in anticipating issues that have not yet been encountered. The increase in the amounts of data needed for training requires better processing methodologies. To address part of this issue, we will also discuss a new approach to data labelling. 1.1. Description of the projects and their data The few applications of ASRs that presently have children for users have little or no children’s speech data at their disposal. Instead, like Project LISTEN at Carnegie Mellon University [3], they have had to use adult female speech models. In order to furnish more appropriate data, the KIDS database recorded 76 children. Since Project LISTEN aims at helping children learn to read, the data consists of text read aloud. There were 2 populations of speakers. First, a population of good readers (SUM95) was recorded in order to obtain as much speech data as possible. Then, children from a school where reading scores are especially low were recorded (FP) in order to get data representative of local dialect and reading hesitations. The DIPLOMAT project [4] is designed to test the feasibility of rapid-deployment, wearable speech translation systems. This means developing a machine translation system that performs initial translations at a useful level of quality between a new language and English within a matter of days or weeks, with continual, graceful improvement to a good level of quality over a period of months. A potential use for DIPLOMAT is to allow English-speaking soldiers on peacekeeping missions to interview local residents. So far, Diplomat has worked with Serbo-Croatian, Creole, and Korean. Since rapid deployment is central to the project, read speech is used. It is faster and less labor-intensive to develop than spontaneous speech. At present, there are 13 speakers for Haitian Creole (hereafter, Creole) (10m, 3f) with 99 to 231 sentences each. For Korean there are 8 speakers (5m., 3f) with 118 to 180 sentences each. Recordings are still underway in both languages. 2. NEW SPEAKER POPULATIONS We group our observations of new populations according assumptions researchers made in the past. We examine how they are no longer valid, and note how we dealt with them.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call