Faculty of Engineering, Tohoku University,Aramaki aza Aoba 6–6–5, Sendai, 980–8579 Japan(Received 28 August 2006, Accepted for publication 27 October 2006)Keywords: CALL system, Pronunciation error detection, Decision tree, ClusteringPACS number: 43.72.Ne [doi:10.1250/ast.28.131]1. IntroductionRecently, several computer-assisted language learning(CALL) systems that exploit spoken language technologyhave been developed. Automatic assessment of a learner’spronunciation is one of the main issues of such systems [1].We are developing a mispronunciation detection systembased on multilingual phone models [2]. This method firstprepares ‘mispronunciation rules,’ which are the mispronun-ciations a learner tends to make. When a learner utters asentence presented by the system, the system prepares anacoustic model of the correct pronunciation of the sentenceas well as that of the mispronounced sentence. Then, thelikelihood values of both models are calculated using thelearner’s utterance, and the two likelihood values arecompared with each other. If the likelihood of the correctmodel is higher, the learner’s utterance is likely to be correct.Otherwise, the utterance is concluded to be mispronounced.One problem of the mispronunciation detection method isthat the method tends to evaluate the utterance more strictlythan human evaluators, because the strictness of a humanevaluator depends on the linguistic context of the phoneme tobe evaluated. To solve this problem, the strictness of detectionof a mispronunciation has to be adjusted so that the system’sjudgment becomes similar to that by a human.In this paper, we propose a method of solving thisproblem by adjusting the thresholds of error detection. Thismethod forms several clusters of mispronunciation rules usinga decision tree [3], and the optimum threshold is determinedfor each cluster.2. Pronunciation error detection based on themispronunciation rulesFirst, we explain the framework of the pronunciation errordetection based on multilingual phone models. In this work,the target language is English and the native language of thelearners is Japanese. First, the system gives the learner a wordor a sentence to pronounce. When the learner utters the wordor sentence, the system records the speech and performs anacoustic analysis. Here, the input speech is denoted by O.Next, the system prepares a model of the correct pronunci-ation of the presented word or sentence. The ‘model’ iscomposed of connected hidden Markov models (HMMs),each of which models a phoneme. This model of the correctpronunciation is denoted by . Then, the mispronunciationrules are applied to the correct model to generate the models
Read full abstract