Towards automatic assessment of spontaneous spoken English

Y Wang,M.J.F Gales,K.M Knill,K Kyriakopoulos,A Malinin,R.C Van Dalen,M Rashid

doi:10.1016/j.specom.2018.09.002

Abstract

With increasing global demand for learning English as a second language, there has been considerable interest in methods of automatic assessment of spoken language proficiency for use in interactive electronic learning tools as well as for grading candidates for formal qualifications. This paper presents an automatic system to address the assessment of spontaneous spoken language. Prompts or questions requiring spontaneous speech responses elicit more natural speech which better reflects a learner’s proficiency level than read speech. In addition to the challenges of highly variable non-native, learner, speech and noisy real-world recording conditions, this requires any automatic system to handle disfluent, non-grammatical, spontaneous speech with the underlying text unknown. To handle these, a strong deep learning based speech recognition system is applied in combination with a Gaussian Process (GP) grader. A range of features derived from the audio using the recognition hypothesis are investigated for their efficacy in the automatic grader. The proposed system is shown to predict grades at a similar level to the original examiner graders on real candidate entries. Interpolation with the examiner grades further boosts performance. The ability to reject poorly estimated grades is also important and measures are proposed to evaluate the performance of rejection schemes. The GP variance is used to decide which automatic grades should be rejected. Back-off to an expert grader for the least confident grades gives gains.

Highlights

There is a high demand around the world for the learning of English as a second language
In [14] the deep neural networks (DNN)-based automatic speech recognition (ASR) system gave 31% relative word error rate (WER) reduction on the data from the Arizona English Language Learner Assessment (AZELLA) test, which is composed of a variety of spoken tasks developed by professional educators
Fluency features are derived from the speech recognition system hypothesis, time aligned to the audio

Summary

December 2017 5 July 2018 2 September 2018

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ALTA Institute / Department of Engineering, University of Cambridge, Cambridge, U.K

Introduction

BULATS data

Transcription generation

Speech Recognition System

Grader Features

Audio and fluency features

Confidence features

Linguistic features

Parse tree features

PoS tag features

Pronunciation Features

Grader