A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Amparo Varona,Germán Bordel,Mikel Penagarikano,Luis Javier Rodriguez-Fuentes

doi:10.3390/app14051951

Abstract

The development of speech technology requires large amounts of data to estimate the underlying models. Even when relying on large multilingual pre-trained models, some amount of task-specific data on the target language is needed to fine-tune those models and obtain competitive performance. In this paper, we present a bilingual Basque–Spanish dataset extracted from parliamentary sessions. The dataset is designed to develop and evaluate automatic speech recognition (ASR) systems but can be easily repurposed for other speech-processing tasks (such as speaker or language recognition). The paper first compares the two target languages, emphasizing their similarities at the acoustic-phonetic level, which sets the basis for sharing data and compensating for the relatively small amount of spoken resources available for Basque. Then, Basque Parliament plenary sessions are characterized in terms of organization, topics, speaker turns and the use of the two languages. The paper continues with the description of the data collection procedure (involving both speech and text), the audio formats and conversions along with the creation and postprocessing of text transcriptions and session minutes. Then, it describes the semi-supervised iterative procedure used to cut, rank and select the training segments and the manual supervision process employed to produce the test set. Finally, ASR experiments are presented using state-of-the-art technology to validate the dataset and to set a reference for future works. The datasets, along with models and recipes to reproduce the experiments reported in the paper, are released through Hugging Face.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 27, 2024
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition
Alicia Lozano-Diez ... Joaquin Gonzalez-Rodriguez
-
Alicia Lozano-Diez, et. al.Alicia Lozano-Diez ... Joaquin Gonzalez-Rodriguez
21 Nov 2018
21 Nov 2018

Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model
Ramya Rasipuram ... Mathew Magimai-Doss
Speech Communication | VOL. 68
Ramya Rasipuram, et. al.Ramya Rasipuram ... Mathew Magimai-Doss
29 Dec 2015
Speech Communication | VOL. 68

Bi-Lingual TDNN-LSTM Acoustic Modeling for Limited Resource Hindi and Marathi Language ASR
Ankit Kumar ... Rajesh Kumar Aggarwal
-
Ankit Kumar, et. al.Ankit Kumar ... Rajesh Kumar Aggarwal
01 Jan 2020
01 Jan 2020

Combined speech enhancement and auditory modelling for robust distributed speech recognition
Ronan Flynn ... Edward Jones
Speech Communication | VOL. 50
Ronan Flynn, et. al.Ronan Flynn ... Edward Jones
20 May 2008
Speech Communication | VOL. 50

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Abstract

Talk to us

Similar Papers

More From: Applied Sciences