Sequence-to-sequence models have been applied to many challenging problems, including those in text and speech technologies. Normalization is one of them. It refers to transforming non-standard language forms into their standard counterparts. Non-standard language forms come from different written and spoken sources. This paper deals with one such source, namely speech from the less-resourced highly inflected Slovenian language. The paper explores speech corpora recently collected in public and private environments. We analyze the efficiencies of three sequence-to-sequence models for automatic normalization from literal transcriptions to standard forms. Experiments were performed using words, subwords, and characters as basic units for normalization. In the article, we demonstrate that the superiority of the approach is linked to the choice of the basic modeling unit. Statistical models prefer words, while neural network-based models prefer characters. The experimental results show that the best results are obtained with neural architectures based on characters. Long short-term memory and transformer architectures gave comparable results. We also present a novel analysis tool, which we use for in-depth error analysis of results obtained by character-based models. This analysis showed that systems with similar overall results can differ in the performance for different types of errors. Errors obtained with the transformer architecture are easier to correct in the post-editing process. This is an important insight, as creating speech corpora is a time-consuming and costly process. The analysis tool also incorporates two statistical significance tests: approximate randomization and bootstrap resampling. Both statistical tests confirm the improved results of neural network-based models compared to statistical ones.
Read full abstract