Automatic Speech Recognition and Pronunciation Error Detection of Dutch Non-native Speech: cumulating speech resources in a pluricentric language

X Wei,C Cucchiarini,R Van Hout,H Strik

doi:10.1016/j.specom.2022.08.004

Abstract

• Improving the performance of Automatic Speech Recognition (ASR) on learner's speech in a non-dominant variety of a pluricentric language by cumulating speech resources from different varieties of the same language. • Through the transfer learning approach, the knowledge from a dominant variety can be transferred to non-dominant variety and this benefits non-native speech recognition. • Introducing plausible pronunciation errors in a native corpus of the non-dominant variety based on knowledge from the dominant variety to evaluate the performance of Pronunciation Error Detection (PED) algorithms. The shortage of large-scale learners’ speech corpora and precise manual annotations are two major challenges for automatic L2 speech recognition and error detection in L2 speech, especially for non-dominant varieties of pluricentric languages. In these cases, collecting and annotating large non-native (L2 learner) corpora for all language varieties is often unattainable. In this study, we investigated ways of addressing these problems through conventional and transfer learning Deep Neural Network (DNN) based Automatic Speech Recognition (ASR) and ASR-based pronunciation error detection (PED) by cumulating Netherlandic Dutch and Flemish Dutch speech resources. First, we show that for ASR the baseline system can be improved by combining the Netherlandic Dutch and Flemish Dutch datasets. Next, through the knowledge learned from models trained on the Netherlandic Dutch data, the Flemish Dutch learners' ASR model can be further improved. In order to evaluate the performance of the PED algorithms in the absence of learner speech data with pronunciation error annotations, we introduced plausible pronunciation errors in the native corpora based on knowledge from Flemish learner speech, in order to simulate non-native speech errors. For PED we found that the results are much better for a GOP classifier trained on Flemish Dutch data than for one trained on Netherlandic Dutch data. PED produced worse results when the Netherlandic Dutch data were merged with the Flemish Dutch data, while for ASR, lower WERs were attained. Whether adding Netherlandic Dutch data to Flemish Dutch data is beneficial, thus seems to depend on the specific task the data are used for. We discuss these results, compare them to those of related research and suggest avenues for future research.

Full Text