Abstract

Second Language (L2) testing has increasingly relied on performance assessment to evaluate practical command of language acquired. However, such forms of assessment entail more complex task design and subjective human scoring judgment (Bachman, 2004), raising challenges for score dependability and score use due to variability associated with task design (Deville & Chalhoub-Deville, 2006; In’nami, & Koizumi, 2016), differences in rater behavior (Bachman, Lynch, & Mason, 1995), and rating rubric functionality, especially when consisting of multiple subscales (Grabowski & Lin, 2019; Sawaki, 2007, Xi, 2007). The current study illustrates the use of Multivariate Generalizability Theory (MG-Theory) analyses for examining score variability and dependability for written performance assessment on an ESL placement test, rated using an analytic rubric with three subscales. In particular, this study identified the presence of task-related variability that did reduce score dependability for the writing scores yielded from this test. By the same token, this variability could substantively be justified as an artifact of representing the construct of L2 writing ability in a sufficiently broad manner. Simply said, should we expect test takers to have equivalent levels of proficiency when writing a review of an experience as a customer and when writing an argumentative essay as a student?

Highlights

  • Second Language (L2) testing has increasingly relied on performance assessment to evaluate “practical command of language acquired” (McNamara, 1996 as cited in Grabowski & Lin, 2019, p. 54)

  • These challenges are more acute for less-resourced institutions using L2 performance assessments for making score-based interpretations about test-takers, such as a language program making decisions based on placement test scores (Bachman, et al, 1996; Sawaki & Xi, 2019; Vafaee & Yaghmaeyan, 2020)

  • The Language Control scale showed the least prominent mean difference across task (M = 3.05, SD = 0.98 for Task 1 and M = 2.78, SD = 1.07 for Task 2), but when submitted to an independent t-test still revealed a statistically significant difference, albeit not quite as pronounced as with the other subscales, M = 0.27, 95% CI [0.07, 0.47], t(414) = 2.684, p = .008. What these findings indicate is that the Content Control scale was interpreted more leniently on both tasks compared to the Organization Control scale and Language Control scale

Read more

Summary

Introduction

Second Language (L2) testing has increasingly relied on performance assessment (e.g., a written essay, a spoken monologue) to evaluate “practical command of language acquired” (McNamara, 1996 as cited in Grabowski & Lin, 2019, p. 54). For agencies that deliver high-stakes L2 proficiency exams (e.g., Educational Testing Service, ETS) a research agenda has been undertaken for years to examine the role of rater, task, and rubric as sources of variability into their performance assessments (Lee, 2006; Sawaki & Sinharay, 2013; Xi, 2007; Xi & Mollaun, 2006) These challenges are more acute for less-resourced institutions using L2 performance assessments for making score-based interpretations about test-takers, such as a language program making decisions based on placement test scores (Bachman, et al, 1996; Sawaki & Xi, 2019; Vafaee & Yaghmaeyan, 2020). The organization that administers the exam, the Community Language Program (CLP), operated within Teacher College, Columbia University’s in conjunction with its in Applied Linguistic and Teaching English to Speakers of Other Languages (TESOL), uses test scores from this exam to assign the test takers to a particular English as a Second Language (ESL) class at an appropriate proficiency level (i.e., beginner, intermediate, advanced)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call