Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value

Vivekanandan Kumar,David Boulanger

doi:10.3389/feduc.2020.572367

Vivekanandan Kumar, David Boulanger

Open Access

https://doi.org/10.3389/feduc.2020.572367

Copy DOI

Journal: Frontiers in Education	Publication Date: Oct 6, 2020
Citations: 35	License type: CC BY 4.0

Affiliation: Athabasca University

Abstract

Automated essay scoring (AES) is a compelling topic in Learning Analytics for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. However, a vast swath of research tackles AES only holistically; few have even developed AES models at the rubric level, the very first layer of explanation underlying the prediction of holistic scores. Consequently, the AES black box has remained impenetrable. Although several algorithms from Explainable Artificial Intelligence have recently been published, no research has yet investigated the role that these explanation models can play in a) discovering the decision-making process that drives AES, b) fine-tuning predictive models to improve generalizability and interpretability, and c) providing personalized, formative, and fine-grained feedback to students during the writing process. Building on previous studies where models were trained to predict both the holistic and rubric scores of essays, using the Automated Student Assessment Prize's essay datasets, this study focuses on predicting the quality of the writing style of Grade-7 essays and exposes the decision processes that lead to these predictions. In doing so, it evaluates the impact of deep learning (multi-layer perceptron neural networks) on the performance of AES. It has been found that the effect of deep learning can be best viewed when assessing the trustworthiness of explanation models. As more hidden layers were added to the neural network, the descriptive accuracy increased by about 10%. This study shows that faster (up to three orders of magnitude) SHAP implementations are as accurate as the slower model-agnostic one. It leverages the state-of-the-art in natural language processing, applying feature selection on a pool of 1592 linguistic indices that measure aspects of text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. In addition to the list of most globally important features, this study reports a) a list of features that are important for a specific essay (locally), b) a range of values for each feature that contribute to higher or lower rubric scores, and c) a model that allows to quantify the impact of the implementation of formative feedback.

Highlights

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity
Since the interpretability of a machine learning model should be prioritized over accuracy (Ribeiro et al, 2016; Murdoch et al, 2019) for questions of transparency and trust, this paper investigated whether the impact of the depth of a multi-layer perceptron (MLP) neural network might be more visible when assessing its interpretability, that is, the trustworthiness of its corresponding SHapley Additive exPlanations (SHAP) explanation model
This paper serves as a proof of concept of the applicability of XAI techniques in automated essay scoring, providing learning analytics practitioners and educators with a methodology on how to “hire” AI markers and make them accountable to their human counterparts

Summary

Introduction

Automated essay scoring (AES) is a compelling topic in Learning Analytics (LA) for the primary reason that recent advances in AI find it as a good testbed to explore artificial supplementation of human creativity. AES may suffer from its own set of biases (e.g., imperfections in training data, spurious correlations, overrepresented minority groups), which has incited the research community to look for ways to make AES more transparent, accountable, fair, unbiased, and trustworthy while remaining accurate This required changing the perception that AES is merely a machine learning and feature engineering task (Madnani et al, 2017; Madnani and Cahill, 2018). Researchers have advocated that AES should be seen as a shared task requiring several methodological design decisions along the way such as curriculum alignment, construction of training corpora, reliable scoring process, and rater performance evaluation, where the goal is to build and deploy fair and unbiased scoring models to be used in large-scale assessments and classroom settings (Rupp, 2018; West-Smith et al, 2018; Rupp et al, 2019) These measures are intended to design reliable and valid AES systems, they may still fail to build trust among users, keeping the AES black box impenetrable for teachers and students

Objectives

Methods

Results

Discussion

Conclusion