Abstract

Abstract Automatic pronunciation assessment has several difficulties.Adequacy in controlling the vocal organs is often estimatedfrom the spectral envelopes of input utterances but the envelopepatterns are also affected by other factors such as speaker iden-tity. Recently, a new method of speech representation was pro-posed where these non-linguistic variations are effectively re-moved through modeling only the contrastive aspects of speechfeatures. This speech representation is called speech struc-ture. However, the often excessively high dimensionality ofthe speech structure can degrade the performance of structure-based pronunciation assessment. To deal with this problem, weintegratemultilayerregressionanalysiswiththestructure-basedassessment. The results show higher correlation between hu-man and machine scores and also show much higher robustnessto speaker differences compared to widely used GOP-basedanalysis.Index Terms: CALL, speech structure, regression, GOP 1. Introduction Automatic pronunciation assessment is a task used to evalu-ate only the linguistic aspect of utterances. However, speechfeatures inevitably include acoustic variations caused by non-linguistic factors such as the speaker, communication chan-nel and noise. The same pronunciation can lead to differentacoustic observations due to different speakers and differentenvironments. To deal with these variations, modern pronun-ciation assessment approaches mainly make use of statisticalmethods to model the distributions of the acoustic features [1].These methods can achieve relatively high performance whenthere is a good match between training and testing conditions.Buttheirperformancealwaysdegradessignificantlywhentheseconditions are mismatched. In Automatic Speech Recogni-tion (ASR), speaker adaptation techniques have proved effec-tive at reducing mismatches. However, if the acoustic modelsused in pronunciation assessment are adapted to learners, in-correct pronunciations might be recognized as correct due toover-adaptation [2].To solve the mismatch problem, the third author of thispaper proposed a new speech representation, called speechstructure, which aims at removing the non-linguistic factorsin speech features [3]. In contrast to classical speech models,speech structures make use of

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.