On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed least absolute shrinkage and selection operator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.

Similar Papers
  • Research Article
  • Cite Count Icon 9
  • 10.1080/15434303.2021.1963253
A Revisit of Zumbo’s Third Generation DIF: How Are We Doing in Language Testing?
  • Oct 2, 2021
  • Language Assessment Quarterly
  • Hongli Li + 2 more

The purpose of this study is to review the status of differential item functioning (DIF) research in language testing, particularly as it relates to the investigation of sources (or causes) of DIF, which is a defining characteristic of the third generation DIF. This review included 110 DIF studies of language tests dated from 1985 to 2019. We found that DIF researchers did not address sources of DIF more frequently in recent years than in earlier years. Nevertheless, DIF research in language testing has expanded with new DIF analysis procedures, more grouping variables, and more diversified methods for investigating sources of DIF. In addition, in the early years of DIF research, methods to identify sources of DIF relied heavily on content analysis. This review showed that while more sophisticated statistical procedures have been adopted in recent years to address sources of DIF, understanding sources of DIF still remains a challenging task. We also discuss the pros and cons of existing methods to detect sources of DIF and implications for future investigations.

  • Research Article
  • 10.38151/akef.2023.72
An Analysis of the DIF Sources of ABIDE Mathematics Self-Efficacy Scale by means of a Latent Class Approach
  • Sep 30, 2023
  • Ahmet Keleşoğlu Eğitim Fakültesi Dergisi
  • Fatih Elkonca + 1 more

This study aims to identify the sources of Differential Item Functioning (DIF) using the Mixture Ordinal Logistic Regression (Mixture OLR) method, a contemporary approach for detecting DIF. To analyze mathematics self-efficacy, data from a scale comprising 9 items were obtained from 5000 8th-grade students as part of the ABIDE-2016 project. The study compared the presence and extent of DIF by gender using two methods and examined the sources of DIF for items displaying DIF with Mixture OLR. The OLR analysis revealed that five items exhibited DIF at level A, but no DIF was observed with Mixture OLR. Furthermore, it was found that the magnitude of DIF (B) for an item showing DIF at level A changed due to Mixture OLR. The results indicate that the homogeneity of the data affects both the number of items displaying DIF and the magnitude of DIF. Three items did not exhibit significant DIF according to both methods. One significant finding in the study highlights the moderating effect of latent class on item 8, where DIF was observed. However, the source of DIF was not related to gender but rather stemmed from different ecological variables. An analysis of latent class characteristics revealed that students with significant DIF effects had lower absenteeism and fewer siblings. Additionally, students in this class had greater access to books at home and participated in more out-of-school mathematics courses. Surprisingly, these students were found to engage less in social activities. Various factors can influence how students respond to test items, potentially leading to DIF. These factors may include cultural background, gender, social environment, school, teacher, family interest/attitude toward the child, and home climate. Therefore, when developing and administering tests, it is crucial to test for data homogeneity and consider the impact of these variables, in addition to gender, to identify any sources of DIF in test items.

  • Research Article
  • Cite Count Icon 113
  • 10.1111/j.1745-3984.2001.tb01121.x
Identifying Sources of Differential Item and Bundle Functioning on Translated Achievement Tests: A Confirmatory Analysis
  • Jun 1, 2001
  • Journal of Educational Measurement
  • Mark J Gierl + 1 more

Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non‐equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11‐member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests.

  • Research Article
  • 10.1177/00131644241279882
Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach.
  • Oct 1, 2024
  • Educational and psychological measurement
  • Shan Huang + 1 more

Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model's generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.

  • Research Article
  • Cite Count Icon 41
  • 10.1207/s15324818ame1904_4
Examining Sources of Gender DIF in Mathematics Assessments Using a Confirmatory Multidimensional Model Approach
  • Oct 1, 2006
  • Applied Measurement in Education
  • Sharon Mendes-Barnett + 1 more

This study contributes to understanding sources of gender differential item functioning (DIF) on mathematics tests. This study focused on identifying sources of DIF and differential bundle functioning for boys and girls on the British Columbia Principles of Mathematics Exam (Grade 12) using a confirmatory SIBTEST approach based on a multidimensional model. Problem solving as a content area was confirmed as a source of gender DIF in favor of boys when the item is presented in the form of a story problem or when the problems are noncontext specific. Patterns and relations content areas produced a mixture of confirmed sources of DIF, with some subtopics favoring the girls and some favoring the boys. In contrast to what might be expected given the findings of previous gender DIF research, this study did not find geometry to be a source of gender DIF. All of the higher cognitive level items favored boys. High levels of DIF were detected in favor of girls on the bundle of computation items in which no equations were provided in the question.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.1186/s12874-019-0828-3
Explaining differential item functioning focusing on the crucial role of external information \u2013 an example from the measurement of adolescent mental health
  • Sep 5, 2019
  • BMC Medical Research Methodology
  • Curt Hagquist

BackgroundAn overarching objective in research comparing different sample groups is to ensure that the reported differences in outcomes are not affected by differences between groups in the functioning of the measurement instruments, i.e. the items have to work in the same way for the different sample groups to be compared. Lack of invariance across sample groups are commonly called Differential Item Functioning (DIF).There is a sense in which the DIF of an item can be taken account of by resolving (splitting) the item into group specific items, rather than deleting the item. Resolving improves fit, retains the reliability and content provided by the item, and compensates for the DIF in estimation of person parameters on the scale of the instrument. However, it destroys invariance of the item’s parameter value among the groups. Whether or not a DIF item should be resolved depends on whether the source of the DIF is relevant or irrelevant for the content of the variable. The present paper shows how external information can be used to investigate if the gender DIF found in the item “Stomach ache” in a psychosomatic symptoms scale used among adolescents may reflect abdominal pain because of a biological factor, the girls’ menstrual periods.MethodsSwedish data from the international Health Behaviour in School-aged Children study (HBSC) collected in 2005/06, 2009/10 and 2013/14 were used, comprising a total of 18,983 students in grades 5, 7 and 9. A composite measure of eight items of psychosomatic problems was analysed for DIF with respect to gender and menstrual periods using the Rasch model.ResultsThe results support the hypothesis that the source of the gender DIF for the item “Stomach ache” is a gender specific biological factor. In that case the DIF should be resolved if the psychosomatic measure is not intended to tap information about abdominal pain caused by a gender specific biological factor. In contrast, if the measure is intended to tap such information, the DIF should not be resolved.ConclusionsThe conceptualisation of the measure governs whether the item showing DIF should be resolved or not.

  • Dissertation
  • 10.17077/etd.1ith8r87
Differential item functioning procedures for polytomous items when examinee sample sizes are small
  • Oct 6, 2011
  • Scott William Wood

As part of test score validity, differential item functioning (DIF) is a quantitative characteristic used to evaluate potential item bias. In applications where a small number of examinees take a test, statistical power of DIF detection methods may be affected. Researchers have proposed modifications to DIF detection methods to account for small focal group examinee sizes for the case when items are dichotomously scored. These methods, however, have not been applied to polytomously scored items. Simulated polytomous item response strings were used to study the Type I error rates and statistical power of three popular DIF detection methods (Mantel test/Cox’s β, Liu-Agresti statistic, HW3) and three modifications proposed for contingency tables (empirical Bayesian, randomization, log-linear smoothing). The simulation considered two small sample size conditions, the case with 40 reference group and 40 focal group examinees and the case with 400 reference group and 40 focal group examinees. In order to compare statistical power rates, it was necessary to calculate the Type I error rates for the DIF detection methods and their modifications. Under most simulation conditions, the unmodified, randomization-based, and log-linear smoothingbased Mantel and Liu-Agresti tests yielded Type I error rates around 5%. The HW3 statistic was found to yield higher Type I error rates than expected for the 40 reference group examinees case, rendering power calculations for these cases meaningless. Results from the simulation suggested that the unmodified Mantel and Liu-Agresti tests yielded the highest statistical power rates for the pervasive-constant and pervasive-convergent patterns of DIF, as compared to other DIF method alternatives. Power rates improved by several percentage points if log-linear smoothing methods were applied to the contingency tables prior to using the Mantel or Liu-Agresti tests. Power rates did not improve if Bayesian methods or randomization tests were applied to the contingency tables prior to using the Mantel or Liu-Agresti tests. ANOVA tests showed that

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3389/feduc.2021.748884
Investigating the Distractors to Explain DIF Effects Across Gender in Large-Scale Tests With Non-Linear Logistic Regression Models
  • Jan 18, 2022
  • Frontiers in Education
  • Burhanettin Ozdemir + 1 more

The purpose of this study is to examine the distractors of items that exhibit differential item functioning (DIF) across gender to explain the possible sources of DIF in the context of large-scale tests. To this end, two non-linear logistic regression (NLR) models-based DIF methods (three parameters, 3PL-NLR and four-parameter, 4PL-NLR) were first used to detect DIF items, and the Mantel-Haenszel Delta (MH-Delta) DIF method was used to calculate the DIF effect size for each DIF item. Then, the multinomial log-linear regression (MLR) model and 2-PL nested logit model (2PL-NLM) were applied to items exhibiting DIF with moderate and large DIF effect sizes. The ultimate goals are (a) to examine behaviors of distractors across gender and (b) to investigate if distractors have any impact on DIF effects. DIF results of the Art Section of the General Aptitude Test (GAT-ART) based on both 3PL-NLR and 4PL-NLR methods indicate that only 10 DIF items had moderate to large DIF effects sizes. According to MLR differential distractor functioning (DDF) results, all items exhibited DDF across gender except for one item. An interesting finding of this study is that DIF items related to the verbal analogy and context analysis were in favor of female students, while all DIF items related to the reading comprehension subdomain were in favor of male students, which may signal the existence of content specific DIF or true ability difference across gender. DDF results show that distractors have a significant effect on DIF results. Therefore, DDF analysis is suggested along with DIF analysis since it signals the possible causes of DIF.

  • Research Article
  • Cite Count Icon 34
  • 10.1080/15305058.2002.9669493
Disentangling Sources of Differential Item Functioning in Multilanguage Assessments
  • Sep 1, 2002
  • International Journal of Testing
  • Kadriye Ercikan

This article describes and discusses strategies used in disentangling sources of differential item functioning (DIF) in multilanguage assessments where multiple factors are expected to be causing DIF. Three strategies are used for identifying adaptation and curricular differences as sources of DIF: (a) judgmental reviews by multiple bilingual translators of all items, (b) cross-validation of DIF in multiple groups, and (c) examination of the distribution of DIF items by topic. Twenty-seven percent of the mathematics DIF items and 37% of the science DIF items were interpreted to be due to adaptation-related differences based on judgmental reviews. Most of these interpretations were also supported by the cross-validation analyses. Clustering of DIF items by topic provided curricular differences as interpretation for DIF only for small portions of the DIF items, approximately 23% of the mathematics DIF items and 13% of the science DIF items.

  • Research Article
  • Cite Count Icon 5
  • 10.11607/ofph.3026
Differential Item Functioning of the Jaw Functional Limitation Scale
  • Jan 1, 2023
  • Journal of Oral and Facial Pain and Headache
  • Swaha Pattanaik + 2 more

To assess the differential item functioning (DIF) of the Jaw Functional Limitation Scale (JFLS) due to gender, age, and language (English vs Spanish). JFLS data were collected from a consecutive sample of 2,115 adult dental patients from HealthPartners dental clinics in Minnesota. Participants with missing data were excluded, and analyses were performed using data from 1,678 participants. Whether the item response theory (IRT) model assumptions of essential unidimensionality and local independence held up for the JFLS was examined. Then, using Samejima's graded response model, the IRT log-likelihood ratio approach was used to detect DIF. The magnitude and impact of DIF based on Raju's noncompensatory DIF (NCDIF) cutoff value of 0.096, Cohen's effect sizes, and test (or scale) characteristic curves were also assessed. Essential unidimensionality was confirmed, but locally dependent items were found on the JFLS. A few items were flagged with statistically significant DIF after adjustment for multiple comparisons. The NCDIF indices associated with all DIF items were < 0.096, and they had small effect sizes of ≤ 0.2. The differences between the expected scores shown in the test characteristic curves were little to none. The present results support the use of the JFLS summary score to obtain psychometrically robust score comparisons across English- and Spanish-speaking, male and female, and younger and older dental patients. Overall, the magnitude of DIF was relatively small, and the practical impact minimal.

  • Research Article
  • Cite Count Icon 76
  • 10.1111/j.1745-3984.2007.00029.x
DIF Detection and Effect Size Measures for Polytomously Scored Items
  • May 1, 2007
  • Journal of Educational Measurement
  • Seock‐Ho Kim + 3 more

Data from a large‐scale performance assessment (N= 105,731) were analyzed with five differential item functioning (DIF) detection methods for polytomous items to examine the congruence among the DIF detection methods. Two different versions of the item response theory (IRT) model‐based likelihood ratio test, the logistic regression likelihood ratio test, the Mantel test, and the generalized Mantel–Haenszel test were compared. Results indicated some agreement among the five DIF detection methods. Because statistical power is a function of the sample size, the DIF detection results from extremely large data sets are not practically useful. As alternatives to the DIF detection methods, four IRT model‐based indices of standardized impact and four observed‐score indices of standardized impact for polytomous items were obtained and compared with the R2 measures of logistic regression.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1186/s40536-024-00200-3
Investigating item complexity as a source of cross-national DIF in TIMSS math and science
  • Apr 22, 2024
  • Large-scale Assessments in Education
  • Qi Huang + 2 more

BackgroundLarge scale international assessments depend on invariance of measurement across countries. An important consideration when observing cross-national differential item functioning (DIF) is whether the DIF actually reflects a source of bias, or might instead be a methodological artifact reflecting item response theory (IRT) model misspecification. Determining the validity of the source of DIF has implications for how it is handled in practice.MethodWe demonstrate a form of sensitivity analysis that can point to model misspecification induced by item complexity as a possible cause of DIF, and show how such a cause of DIF might be accommodated through attempts to generalize the IRT model for the studied item(s) in psychometrically and psychologically plausible ways.ResultsIn both simulated illustrations and empirical data from TIMSS 2011 and TIMSS 2019 4th and 8th Grade Math and Science, we have found that using a form of proposed IRT model generalization can substantially reduce DIF when IRT model misspecification is at least a partial cause of the observed DIF.ConclusionsBy demonstrating item complexity as a possible valid source of DIF and showing the effectiveness of the proposed approach, we recommend additional attention toward model generalizations as a means of addressing and/or understanding DIF.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s12564-009-9039-7
Examining type I error and power for detection of differential item and testlet functioning
  • Jun 10, 2009
  • Asia Pacific Education Review
  • Young-Sun Lee + 2 more

In this study, the effectiveness of detection of differential item functioning (DIF) and testlet DIF using SIBTEST and Poly-SIBTEST were examined in tests composed of testlets. An example using data from a reading comprehension test showed that results from SIBTEST and Poly-SIBTEST were not completely consistent in the detection of DIF and testlet DIF. Results from a simulation study indicated that SIBTEST appeared to maintain type I error control for most conditions, except in some instances in which the magnitude of simulated DIF tended to increase. This same pattern was present for the Poly-SIBTEST results, although Poly-SIBTEST demonstrated markedly less control of type I errors. Type I error control with Poly-SIBTEST was lower for those conditions for which the ability was unmatched to test difficulty. The power results for SIBTEST were not adversely affected, when the size and percent of simulated DIF increased. Although Poly-SIBTEST failed to control type I errors in over 85% of the conditions simulated, in those conditions for which type I error control was maintained, Poly-SIBTEST demonstrated higher power than SIBTEST.

  • Research Article
  • Cite Count Icon 66
  • 10.1007/s11136-007-9186-4
Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): applications (with illustrations) to measures of physical functioning ability and general distress
  • May 5, 2007
  • Quality of Life Research
  • Jeanne A Teresi + 8 more

Methods based on item response theory (IRT) that can be used to examine differential item functioning (DIF) are illustrated. An IRT-based approach to the detection of DIF was applied to physical function and general distress item sets. DIF was examined with respect to gender, age and race. The method used for DIF detection was the item response theory log-likelihood ratio (IRTLR) approach. DIF magnitude was measured using the differences in the expected item scores, expressed as the unsigned probability differences, and calculated using the non-compensatory DIF index (NCDIF). Finally, impact was assessed using expected scale scores, expressed as group differences in the total test (measure) response functions. The example for the illustration of the methods came from a study of 1,714 patients with cancer or HIV/AIDS. The measure contained 23 items measuring physical functioning ability and 15 items addressing general distress, scored in the positive direction. The substantive findings were of relatively small magnitude DIF. In total, six items showed relatively larger magnitude (expected item score differences greater than the cutoff) of DIF with respect to physical function across the three comparisons: "trouble with a long walk" (race), "vigorous activities" (race, age), "bending, kneeling stooping" (age), "lifting or carrying groceries" (race), "limited in hobbies, leisure" (age), "lack of energy" (race). None of the general distress items evidenced high magnitude DIF; although "worrying about dying" showed some DIF with respect to both age and race, after adjustment. The fact that many physical function items showed DIF with respect to age, even after adjustment for multiple comparisons, indicates that the instrument may be performing differently for these groups. While the magnitude and impact of DIF at the item and scale level was minimal, caution should be exercised in the use of subsets of these items, as might occur with selection for clinical decisions or computerized adaptive testing. The issues of selection of anchor items, and of criteria for DIF detection, including the integration of significance and magnitude measures remain as issues requiring investigation. Further research is needed regarding the criteria and guidelines appropriate for DIF detection in the context of health-related items.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/10627197.2024.2374298
An Analysis of DIF and Sources of DIF in Achievement Motivation Items Using Anchoring Vignettes
  • Jul 8, 2024
  • Educational Assessment
  • J A Bialo + 1 more

This study evaluated differential item functioning (DIF) in achievement motivation items before and after using anchoring vignettes as a statistical tool to account for group differences in response styles across gender and ethnicity. We applied the nonparametric scoring of the vignettes to motivation items from the 2015 Programme for International Student Assessment (PISA) and examined changes to DIF characteristics between the raw self-report item scores and the vignette-adjusted item scores. Overall, applying the anchoring vignettes changed DIF classification, magnitude, direction, and form for some items. Group-specific response patterns by gender and by ethnicity group were also seen. Our findings contribute to the research literature on observed response styles in PISA motivation items, DIF in PISA items, and changes to DIF characteristics after adjusting self-report item scores with anchoring vignettes.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.