Abstract

In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition ASR is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.

Highlights

  • A UTOMATIC speech recognition (ASR) has the potential to provide database access, simultaneous translation, and text/voice messaging services to anybody, in any language, dramatically reducing linguistic barriers to economic success

  • Instead of recruiting native transcribers in search of a perfect reference transcript, this paper proposes the use of probabilistic transcripts

  • In order to evaluate the effectiveness of the EEG-induced misperception transducer we looked at the LPER of mismatched crowdsourcing for Dutch when performed using 1) a multilingual misperception model ρ(λ|φ)

Read more

Summary

Introduction

A UTOMATIC speech recognition (ASR) has the potential to provide database access, simultaneous translation, and text/voice messaging services to anybody, in any language, dramatically reducing linguistic barriers to economic success. ASR has failed to achieve its potential, because successful ASR requires very large labeled corpora; the human transcribers must be computer-literate, and they must be native speakers of the language being transcribed. In order to create the databases reported in this paper, for example, we sought paid native transcribers, at a competitive wage, for the 68 languages in which we have untranscribed audio data. In order to develop speech technology, it is necessary first to have (1) some amount of recorded speech audio, and (2) some amount of text written in the target language.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.