Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

Nele Põldvere,Victoria Johansson,Johan Frid,Carita Paradis

doi:10.32714/ricl.09.01.04

Abstract

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.

Highlights

This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges
This is different from corpus linguistics where, normally, corpora are intended to be useful for a wide variety of linguistic interests, and where many researchers consider the primary data to be the transcriptions with annotations of lexical, morpho-syntactic and discourse features (Oostdijk and Boves 2008: 196)
We have focused on two challenges that we had to tackle during the compilation of LLC-2: 1) the alignment of the orthographic transcriptions with the audio files and 2) the anonymisation of personal information in the recordings

Summary

INTRODUCTION1

With the advent of several new spoken corpora, challenges related to the various aspects of spoken corpus compilation are currently receiving more and more attention in the research community The aim of this article is to describe and propose solutions to key challenges of preparing and releasing audio material for spoken data It is based on our experience of compiling the new London-Lund Corpus 2 (LLC-2; Põldvere et al in press b.; see the user guide in Põldvere et al in press a.). As is the case in many other spoken corpora, the transcriptions in LLC-2 are orthographic and contain information about basic features of spoken interaction such as pauses, overlapping speech and nonverbal vocalisations, but not prosodic and temporal information about pitch movement and the length of transitions between speaker turns These features are, important for spoken language research because they carry useful information about speaker intent.

AUDIO MATERIAL IN SPOKEN CORPORA

A review of corpora of spoken British English

CHALLENGES OF PREPARING LLC-2 AUDIO FILES FOR RELEASE

Audio-to-text alignment

Anonymisation

Applications of LLC-2 audio material

CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Research in Corpus Linguistics	Publication Date: Jan 1, 2021
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Research in Corpus Linguistics

Lead the way for us

Similar Papers

Digital Processing and Storage of Audio Data Based on Tag Information
Dunni Ye
-
Dunni YeDunni Ye
01 Dec 2022
01 Dec 2022

TEST CONTROL IN ENGLISH LANGUAGE TRAINING
I V Malecka
Continuing Professional Education: Theory and Practice | VOL. -
I V MaleckaI V Malecka
01 Jan 2015
Continuing Professional Education: Theory and Practice | VOL. -

National Oceanic and Atmospheric Administration (NOAA) Office of Ocean Exploration's (OE) video server: the library portal
J.A Beattie ... M.L Crane
-
J.A Beattie, et. al.J.A Beattie ... M.L Crane
29 Oct 2002
29 Oct 2002

OnThe London–Lund Corpus 2: design, challenges and innovations
Nele Põldvere ... Victoria Johansson
English Language and Linguistics | VOL. 25
Nele Põldvere, et. al.Nele Põldvere ... Victoria Johansson
01 Sep 2021
English Language and Linguistics | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Research in Corpus Linguistics