A survey of Polish ASR speech datasets

Michał Junczyk

doi:10.1515/psicl-2023-0019

Abstract

Abstract Access to speech datasets is essential for the effective use of modern ASR systems in low-resource languages like Polish. However, the lack of centralized information and metadata describing available datasets poses a significant challenge to researchers and practitioners. In this paper, we address this issue by presenting the most comprehensive survey of Polish ASR speech datasets to date. We manually curated information on 53 publicly available datasets and annotated them with 61 attributes, providing a comprehensive catalog of these resources. The catalog facilitates the discovery and evaluation of available datasets, enabling researchers to identify datasets that suit their specific needs. It also enables the identification of gaps in the existing datasets, which may inform future research directions. The catalog is open and community-driven, which means that new data sets can be added and issues can be reported, ensuring its continued relevance and usefulness to the ASR community. Our work contributes to improving the accessibility and usability of ASR systems in low-resource languages such as Polish.

Full Text