Abstract


 
 
 The main objective of this study was to examine whether a Rater Identity Development (RID) program would increase interrater reliability and improve calibration of scores against benchmarks in the assessment of second/foreign language English oral proficiency. Eleven primary school teachers-as-raters participated. A pretest–intervention/RID–posttest design was employed and data included 220 assessments of student performances. Two types of rater-reliability analyses were conducted: first, estimates of the intraclass correlation coefficient two-way random effects model, in order to indicate the extent to which raters were consistent in their rankings, and second, a many-facet Rasch measurement analysis, extended through FACETS®, to explore variation regarding systematic differences of rater severity/leniency. Results showed improvement in terms of consistency, presumably as a result of training; simultaneously, the differences in severity became greater. Results suggest that future rater training may draw on central components of RID, such as core concepts in language assessment, individual feedback, and social moderation work.
 
 

Highlights

  • Introduction and AimsIn this exploratory study, we target equity in the assessment of second language English oral proficiency

  • Research targeting the assessment of young learners is much needed. This study addresses this under-researched area of L2 oral proficiency assessment with a particular focus on whether interrater reliability between teachers, as examiners of a high-stakes speaking assessment for young L2 English learners, can improve with a researchbased training program

  • The individual feedback was used as a central part of the rater training program and our results indicate that it had an effect, because there was change in the assessment practices

Read more

Summary

Introduction

We target equity in the assessment of second language English oral proficiency. In order to achieve equity in the assessment of a complex ability such as speaking in a second or foreign language (L2), it is crucial that raters are clear about the test construct and grade criteria, and about the interpretation of scores. In contexts like the one where this research has been conducted, Sweden, where oral assessments play a vital role in end of year report cards, low reliability can have direct and detrimental effects; the same level of oral proficiency displayed in a test may be scored differently by different scorers. If the rater is the test-takers’ own teacher, which is the case in the test used for the present study, the situation possibly becomes even more challenging (McNamara, 2001; Sundqvist, Wikström, Sandlund, & Nyroos, 2018) and the evidence for effective methods even more scarce

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call