Abstract

Abstract The transcription bottleneck is often cited as a major obstacle for efforts to document the world’s endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.

Highlights

  • Most of the world’s languages only exist in spoken form

  • Behind the formats is the process for creating them: No matter how careful I think I am being with my transcriptions, from the very first text to the very last, for every language that I have ever studied in the field, I have had to re-transcribe my earliest texts in the light of new analyses that have come to light by the time I got to my later texts

  • We review existing computational approaches to transcription that go beyond the methods inspired by automatic speech recognition, and consider to what extent they already address the requirements coming from the practices of linguists

Read more

Summary

Introduction

Most of the world’s languages only exist in spoken form. These oral vernaculars include endangered languages and regional varieties of major languages. Even assuming that linguists comply with these exhortations, they must still correct the output of the recognizer while re-listening to the source audio, and they must still identify words and produce a word-level transcription. There are locally available skills, such as the ability of speakers to recognize words in context, repeat them in isolation, and say something about what they mean This leads us to consider a new model for large scale transcription that consists of identifying and cataloging words in an open-ended speech collection. I elaborate this “Sparse Transcription Model” and argue that it is a good fit to the task of transcribing oral languages, in terms of the available inputs, the desired outputs, and the available human capacity. I conclude with a summary of the contributions, highlighting benefits for flexibility, for scalability, and for working effectively alongside speakers of oral languages (Section 5)

Background
Why Linguists Transcribe
How Linguists Transcribe
Technological Support for Working with Oral Languages
Requirements for Learning to Transcribe
Computation
Segmenting and Aligning Phone Sequences
Leveraging Translations for Segmentation
Bypassing Transcription
Spoken Term Detection
Test Sets and Evaluation Measures
Summary
Addressing the Requirements
The Sparse Transcription Model
Overview
Transcription Tasks
Transcription Workflows
Evaluation
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call