ABSTRACT Advances in automatic speech recognition technology, increases in bandwidth availability, and the widespread use of video streaming and sharing platforms have opened new horizons for corpus phonetics. CoANZSE Audio, a searchable online version of the Corpus of Australian and New Zealand Spoken English, provides access to over 195 million words of transcribed speech from transcripts of videos uploaded to YouTube by councils and other local government entities in Australia and New Zealand. Audio and forced alignment files are also available, making the resource suitable for the investigation of a range of research questions pertaining to morphosyntax, phonetics, and discourse. The resource, which is freely available via login through CLARIN, Europe’s main language resources infrastructure network, was created through the use of open-source tools and software: yt-dlp, a Python library for collecting data from video and streaming websites; the Montreal Forced Aligner, a recent neural network alignment suite; and Parselmouth-Praat, Python bindings for the Praat acoustic analysis software. The website is powered by BlackLab, which combines a powerful search engine based on Apache Lucene with an intuitive web frontend. CoANZSE Audio may be useful for the investigation of regional differentiation of language features, and with additional annotation, differences in feature use according to social or demographic groups. Recent applications have included studies of double modals, a rare syntactic feature, and apology sequences. The nature of the audio and alignment data may make the resource especially suitable for the study of regional phonetic variation. Furthermore, the methods used to create the resource may be of interest to researchers seeking to adopt a pipeline approach for the creation of specialized corpora from publicly available online content.
Read full abstract