Abstract

Differential Item Functioning (DIF) analysis is always an indispensable methodology for detecting item and test bias in the arena of language testing. This study investigated grade-related DIF in the General English Proficiency Test-Kids (GEPT-Kids) listening section. Quantitative data were test scores collected from 791 test takers (Grade 5 = 398; Grade 6 = 393) from eight Chinese-speaking cities, and qualitative data were expert judgments collected from two primary school English teachers in Guangdong province. Two R packages “difR” and “difNLR” were used to perform five types of DIF analysis (two-parameter item response theory [2PL IRT] based Lord’s chi-square and Raju’s area tests, Mantel-Haenszel [MH], logistic regression [LR], and nonlinear regression [NLR] DIF methods) on the test scores, which altogether identified 16 DIF items. ShinyItemAnalysis package was employed to draw item characteristic curves (ICCs) for the 16 items in RStudio, which presented four different types of DIF effect. Besides, two experts identified reasons or sources for the DIF effect of four items. The study, therefore, may shed some light on the sustainable development of test fairness in the field of language testing: methodologically, a mixed-methods sequential explanatory design was adopted to guide further test fairness research using flexible methods to achieve research purposes; practically, the result indicates that DIF analysis does not necessarily imply bias. Instead, it only serves as an alarm that calls test developers’ attention to further examine the appropriateness of test items.

Highlights

  • It is self-evident that language tests should be fair to all the test takers, rather than favoring or disfavoring any test taker groups because of construct-irrelevant issues such as gender, age, and native languages

  • Considering the above-mentioned reasons, this study aims to detect grade Differential Item Functioning (DIF) in a test for children, the GEPT-Kids, and to find out potential reasons for such DIF

  • While this study shows that test items with certain characteristics tend to favor a particular group of test takers, and it inherently does not explain why such a relationship exists or why some test items with the same characteristics do not favor any group

Read more

Summary

Introduction

It is self-evident that language tests should be fair to all the test takers, rather than favoring or disfavoring any test taker groups because of construct-irrelevant issues such as gender, age, and native languages. Empirical studies have conducted DIF analysis to detect problematic test items, providing evidence for test quality and fairness. The grade-related DIF, emphasizing years of receiving English education, in tests for young children has been under-researched. Since higher grade students tend to be more cognitively developed due to their extra years of receiving English education, it is speculated that they are likely to be favored in a test even when the overall ability of the higher and the lower grade students is controlled for (i.e., they are more likely to get the correct answer even if they have the same overall ability as the lower grade students). The indispensability of grade cannot be neglected in that it might influence test takers’ test performance, and further challenge the fairness and validity of the assessment

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call