Abstract
Corpora of transcribed regional speech are important for the study of dialects of English, but relatively few large corpora of transcribed naturalistic speech from the United Kingdom and Ireland exist. This paper presents the The Corpus of British Isles Spoken English (CoBISE), 112-million-word corpus of Automatic Speech Recognition (ASR) transcripts of YouTube videos from channels of councils and other government entities in the UK and Ireland. Transcripts are linked to publicly-available videos, so the corpus can also serve as a starting point for the study of multimodal phenomena. The paper describes the methods used for identifying relevant channels and the scripting pipeline for data collection and processing. Because ASR transcripts contain errors, analyses undertaken using the corpus should employ methods suitable for dealing with “noisy data”. Two possible approaches are described: for frequent phenomena, appropriate feature selection and use of robust classification models, and for rare phenomena, manual inspection of the audio/video data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Digital Humanities in the Nordic and Baltic Countries Publications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.