Abstract
Abstract The transcription bottleneck is often cited as a major obstacle for efforts to document the world’s endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.
Highlights
Most of the world’s languages only exist in spoken form
Behind the formats is the process for creating them: No matter how careful I think I am being with my transcriptions, from the very first text to the very last, for every language that I have ever studied in the field, I have had to re-transcribe my earliest texts in the light of new analyses that have come to light by the time I got to my later texts
We review existing computational approaches to transcription that go beyond the methods inspired by automatic speech recognition, and consider to what extent they already address the requirements coming from the practices of linguists
Summary
Most of the world’s languages only exist in spoken form. These oral vernaculars include endangered languages and regional varieties of major languages. Even assuming that linguists comply with these exhortations, they must still correct the output of the recognizer while re-listening to the source audio, and they must still identify words and produce a word-level transcription. There are locally available skills, such as the ability of speakers to recognize words in context, repeat them in isolation, and say something about what they mean This leads us to consider a new model for large scale transcription that consists of identifying and cataloging words in an open-ended speech collection. I elaborate this “Sparse Transcription Model” and argue that it is a good fit to the task of transcribing oral languages, in terms of the available inputs, the desired outputs, and the available human capacity. I conclude with a summary of the contributions, highlighting benefits for flexibility, for scalability, and for working effectively alongside speakers of oral languages (Section 5)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.