Abstract

A simulation study was conducted to examine the effect of large ability differences using two differential item functioning (DIF) detection procedures, SIBTEST and TESTGRAF. DIF items are hard to identify when group ability differences are large (Gotzmann, Vandenberghe, & Gierl, 2000; Hambleton & Rogers, 1989). This problem was investigated in the current study for the SIBTEST and TESTGRAF DIF detection procedures. Four ability differences (0.0, -1.0, -1.5, -2.0) and eight sample sizes (500/500, 750/1000, 1000/1000, 750/1500, 1000/1500, 1500/1500, 1000/2000, 2000/2000) were manipulated in a simulation study. Type I error and power rates were computed. The SIBTEST Type I error rates were inflated at the larger abilitjt differences. Conversely, the TESTGRAF Type I error rates remained low for most ability differences and sample sizes. The SIBTEST power rates remained high, even with larger ability differences. The TESTGRAF power rates dropped as ability differences were introduced. Ability Differences 3 The Effects of Large Ability Differences on Type I Error and Power Rates using the SIBTEST and TESTGRAF DIF Detection Procedures Educational practitioners and test developers often find large test scores differences when comparing examinees with diverse ethnic backgrounds (Berends & Koretz, 1996; Cameron, 1990; Freed le & Kostin, 1990; Scheuneman & Grima, 1997; Schmitt & Dorans, 1990). Reducing these differences is one goal in the educational reform movement (Barron & Koretz, 1996). These large test score differences are particularly noteworthy when Native and non-Native examinees are compared (Alberta Education, 1996; Gotzmann, Vandenberghe, & Gierl, 2000; Hambleton & Rogers, 1989; Vandenberghe & Gierl, 2001). Socioeconomic and cultural differences may contribute to these performance differences (Common & Frost, 1989; Hull, 1990; Trent & Gilman, 1985; Wood & Clay, 1996). However, few researchers have studied item-level outcomes which may explain why Native examinees score lower than non-Native examinees (Gotzmann et al., 2000; Hambleton & Rogers, 1989). Native examinee scores may be biased due to factors in test development. For example, Janzen (2000) and Krywaniuk and Das (1976) found that Native children are more likely to use simultaneous processing skills and non-Native children are more likely to use successive processing skills. If exams have a small number of items that illicit simultaneous processing skills, then these exams may put Native examinees at a disadvantage. Therefore, assessment of bias at the item level, and its contribution to the total test score differences, should be studied. Item bias can be estimated with different methods. Traditionally, item-level differences between groups have been assessed by comparing the proportion correct Ability Differences 4 for each group (Lord, 1980). However, this method has one major flaw. The proportion correct method compares all examinees, regardless of ability level. Thus, the proportion correct is dependent upon the sample of examinees (see Camilli & Shepard, 1994). To overcome this problem, statistical methods can be used to determine whether differential item functioning (DIF) is present. DIF occurs when examinees from different groups have a different probability of answering the ite-m,ebrrectly, after controlling for overall ability. In these comparisons, the majority group is called the reference group and the minority group is called the focal group. DIF methods are used to estimate bias by matching examinees on an internal measure of ability or overall test score performance and comparing these examinees at the item level. This approach removes total test score differences in the estimation process, which provides a stronger measure of the actual group differences on the item. There are many statistical procedures to estimate DIF including Item Response Theory (IRT) area measures (Lord, 1980; Thissen, Steinberg, & Wainer, 1988), MantelHaenszel (Holland & Thayer, 1988), Logistic Regression (Swaminathan & Rogers, 1990), Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, 1993), and TESTGRAF (Ramsay, 1991, 2000). Most of these procedures have been used to identify DIF between ethnic groups. However, the SIBTEST and TESTGRAF procedures may be suitable when large ability differences are found. Further, both of these DIF detection procedures can be used with small sample sizes and both yield comparable DIF measures (Ramsay, 1991; 2000; Shealy & Stout, 1993). However, these procedures also have a noteworthy difference. SIBTEST uses a regression correction to estimate

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call