For a Greater Good: Bias Analysis in Writing Assessment

Masoumeh Ahmadi Shirazi

doi:10.1177/2158244018822377

Abstract

Threats to construct validity should be reduced to a minimum. If true, sources of bias, namely raters, items, tests as well as gender, age, race, language background, culture, and socio-economic status need to be spotted and removed. This study investigates raters’ experience, language background, and the choice of essay prompt as potential sources of biases. Eight raters, four native English speakers and four Persian L1 speakers of English as a Foreign Language (EFL), scored 40 essays on one general and one field-specific topic. The raters assessed these essays based on Test of English as a Foreign Language (TOEFL) holistic and International English Language Testing System (IELTS) analytic band scores. Multifaceted Rasch Measurement (MFRM) was run to find extant biases. In spite of not finding statistically significant biases, several interesting results emerged illustrating the influence of construct-irrelevant factors such as raters’ experience, L1, and educational background. Further research is warranted to investigate these factors as potential sources of rater bias.

Highlights

Finding sources of bias in language tests has perhaps been one of the most critical issues so far under scrutiny by researchers, testers, and teachers around the globe
The goals sought by this research are to lessen or remove the biases stemming from these issues using Multifaceted Rasch Measurement (MFRM) to find any interaction of raters’ main effects and essay prompt
Considering the importance of further research in such factors as raters’ experience, language background, and the effect of essay prompt in writing assessment, we attempt to find answers to the following research questions: Research Question 1: Can the raters’ experience be a cause of bias in assessing essays holistically and analytically? Research Question 2: Are English native and L1-Persian raters severe or lenient in writing assessment using analytic versus holistic rating scales? Research Question 3: Does the essay prompt introduce bias into raters’ assessment of writings?

Summary

Introduction

Finding sources of bias in language tests has perhaps been one of the most critical issues so far under scrutiny by researchers, testers, and teachers around the globe. Working on the Test of German as a Foreign Langauge (TestDaf), Eckes (2005) examined rater severity, bias/interaction of raters, examinees, rating criteria, and gender. 18 ratings of raters were put to MFRM through which criterion-related bias measures were estimated He reached the point that “criteria perceived as highly important were more closely associated with severe ratings, and criteria perceived as less important were more closely associated with lenient ratings” The term experienced raters met these criteria: Having master of arts or education degree, being either graduate or English as a Second Language (ESL) instructors, having been involved in teaching and rating ESL writing for a minimum of 5 years, having undergone special assessment training, and considering themselves as competent or expert raters. Considering the importance of further research in such factors as raters’ experience, language background, and the effect of essay prompt in writing assessment, we attempt to find answers to the following research questions: Research Question 1: Can the raters’ experience be a cause of bias in assessing essays holistically and analytically? Research Question 2: Are English native and L1-Persian raters severe or lenient in writing assessment using analytic versus holistic rating scales? Research Question 3: Does the essay prompt introduce bias into raters’ assessment of writings?

Participants

Procedure

Design

Discussion and Conclusion