The Corpus of British Isles Spoken English (CoBISE)

Steven Coats

doi:10.5617/dhnbpub.11286

Abstract

Corpora of transcribed regional speech are important for the study of dialects of English, but relatively few large corpora of transcribed naturalistic speech from the United Kingdom and Ireland exist. This paper presents the The Corpus of British Isles Spoken English (CoBISE), 112-million-word corpus of Automatic Speech Recognition (ASR) transcripts of YouTube videos from channels of councils and other government entities in the UK and Ireland. Transcripts are linked to publicly-available videos, so the corpus can also serve as a starting point for the study of multimodal phenomena. The paper describes the methods used for identifying relevant channels and the scripting pipeline for data collection and processing. Because ASR transcripts contain errors, analyses undertaken using the corpus should employ methods suitable for dealing with “noisy data”. Two possible approaches are described: for frequent phenomena, appropriate feature selection and use of robust classification models, and for rare phenomena, manual inspection of the audio/video data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Corpus of British Isles Spoken English (CoBISE)

Abstract

Talk to us

Similar Papers

More From: Digital Humanities in the Nordic and Baltic Countries Publications

Lead the way for us

Journal: Digital Humanities in the Nordic and Baltic Countries Publications	Publication Date: Oct 6, 2022
License type: CC BY 4.0

Similar Papers

Protecting Sensitive Customer Information in Call Center Recordings
Tanveer A Faruquie ... L Venkata Subramaniam
-
Tanveer A Faruquie, et. al.Tanveer A Faruquie ... L Venkata Subramaniam
01 Jan 2009
01 Jan 2009

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
Sujeong Cha ... Samuel Thomas
-
Sujeong Cha, et. al.Sujeong Cha ... Samuel Thomas
30 Aug 2021
30 Aug 2021

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
Guan-Ting Lin ... Shu-Wen Yang
-
Guan-Ting Lin, et. al.Guan-Ting Lin ... Shu-Wen Yang
18 Sep 2022
18 Sep 2022

Automatic Speech Recognition in Primary Progressive Apraxia of Speech.
Katerina A Tetzloff ... Rene L Utianski
Journal of speech, language, and hearing research : JSLHR | VOL. 67
Katerina A Tetzloff, et. al.Katerina A Tetzloff ... Rene L Utianski
06 Aug 2024
Journal of speech, language, and hearing research : JSLHR | VOL. 67

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Corpus of British Isles Spoken English (CoBISE)

Abstract

Talk to us

Similar Papers

More From: Digital Humanities in the Nordic and Baltic Countries Publications