Abstract
This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.
Highlights
This paper describes the participation of the Machine Learning and Language Processing (MLLP) research group from the Valencian Research Institute for Artificial Intelligence (VRAIN), hosted at the Universitat Politècnica de València (UPV), in the Albayzín-Radio y Televisión Española (RTVE)
This work aims to describe our latest developments in this area, showing how advanced automatic speech recognition (ASR) technology can be successfully applied under streaming conditions, by providing high-quality transcriptions and state-of-the-art system latencies on real-life tasks such as the RTVE (Radiotelevisión Española) database
ASR system for this competition, we explored streaming-related decoding parameters to optimize Word Error Rate (WER) on dev1-dev, using the BLSTM-Hidden Markov Models (HMM) acoustic models (AM) and a linear combination of all three language models (LMs)
Summary
2020 Speech-to-Text (S2T) Challenge, with an extension focused on building equivalent systems under the 2018 closed data conditions. The article is an extended version of the original submission to the Challenge, presented in IberSPEECH 2020 [1]. Live audio and video streams such as TV broadcasts, conferences, lectures, as well as general-public video streaming services (e.g., YouTube) over the Internet have increased dramatically in recent years because of advances in networking with high speed connections and proper bandwidth. Due to the COVID-19 pandemic, video meeting/conferencing platforms have experienced an exponential growth of usage, as public 4.0/). More and more countries are requiring by law that TV broadcasters provide accessibility options to people with hearing disabilities, with the minimum amount of content to be captioned increasing year by year [2,3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.