Abstract

With increasing global demand for learning English as a second language, there has been considerable interest in methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as well as for grading candidates for formal qualifications. This paper presents an automatic system to address the assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these, a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP) grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader for the least confident grades gives gains.

Highlights

  • There is a high demand around the world for the learning of English as a second language

  • In [14] the deep neural networks (DNN)-based automatic speech recognition (ASR) system gave 31% relative word error rate (WER) reduction on the data from the Arizona English Language Learner Assessment (AZELLA) test, which is composed of a variety of spoken tasks developed by professional educators

  • Fluency features are derived from the speech recognition system hypothesis, time aligned to the audio

Read more

Summary

December 2017 5 July 2018 2 September 2018

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ALTA Institute / Department of Engineering, University of Cambridge, Cambridge, U.K

Introduction
BULATS data
Transcription generation
Speech Recognition System
Grader Features
Audio and fluency features
Confidence features
Linguistic features
Parse tree features
PoS tag features
Pronunciation Features
Grader
Experiments
Grader performance
Interpolation with examiner grader
Rejection of scores
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call