Abstract
An increasing number of individuals are acquiring proficiency in Mandarin, signifying the growing significance of employing computer-assisted pronunciation training systems for Mandarin learners. One pivotal component within these systems is the technique for identifying and addressing mispronunciations, known as Mispronunciation Detection and Diagnosis (MDD). Recently, certain end-to-end techniques have tried to fuse features of prompt text and acoustic features into the model for training and have shown good results. However, previous approaches have fused acoustic features with prompt text features by a simple attention mechanism. In this paper, we posit that the impact of text features varies significantly when mapped to distinct acoustic characteristics. Furthermore, we propose that the prompt text can lead the model towards achieving an integrated text-audio representation, thereby enhancing the inference quality. Hence, this article presents a model aimed at detecting and diagnosing mispronunciations. The model utilizes a bidirectional attention mechanism to integrate acoustic and prompt text features. Good results were achieved by conducting experiments on a self-built dataset of short Mandarin read-aloud texts.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.