The number of tests being translated and adapted from 1 language and culture to others is increasing substantially. One shortcoming in current methodology for identifying flawed items due to the test translation-adaptation process is the failure to carry out empirical analyses. One important reason for not conducting empirical studies is the view that large examinee samples are required that are often not available in translation-adaptation studies. The purpose of this article was to investigate 2 simple procedures for detecting potentially flawed items with small samples: (a) conditional item p value comparisons, and (b) delta plots. Several factors were varied in this computer simulation study: sample sizes and ability distributions of the reference and focal groups, amount of differential item functioning (DIF), and the statistical characteristics of the items where DIF was found. The findings showed that the 2 simple graphical-descriptive procedures can be valuable in identifying flawed test items, especially when the size of the flaws is substantial. An application of both procedures to actual test data also supported their utility. Although this study was stimulated by questions that have arisen in the context of language translations of tests, the procedures for identifying potentially flawed items are equally applicable for identifying other potential sources of bias in the test items such as gender and race.1
Read full abstract