Anchor Detection Strategy in Moderated Non-Linear Factor Analysis for Differential Item Functioning (DIF).
Ensuring measurement invariance is crucial for fair psychological and educational assessments, particularly in detecting Differential Item Functioning (DIF). Moderated Non-linear Factor Analysis (MNLFA) provides a flexible framework for detecting DIF by modeling item parameters as functions of observed covariates. However, a significant challenge in MNLFA-based DIF detection is anchor item selection, as improperly chosen anchors can bias results. This study proposes a refined constrained-baseline anchor detection approach utilizing information criteria (IC) for model selection. The proposed three-step procedure sequentially identifies potential DIF items through the Bayesian Information Criterion (BIC) and Weighted Information Criterion (WIC), followed by DIF-free anchor items using the Akaike Information Criterion (AIC). The method's effectiveness in controlling Type I error rates while maintaining statistical power is evaluated through simulation studies and empirical data analysis. Comparisons with regularization approaches demonstrate the proposed method's accuracy and computational efficiency.
- Research Article
- 10.1111/emip.12669
- May 20, 2025
- Educational Measurement: Issues and Practice
Module AbstractWhen investigating potential bias in educational test items via differential item functioning (DIF) analysis, researchers have historically been limited to comparing two groups of students at a time. The recent introduction of Moderated Nonlinear Factor Analysis (MNLFA) generalizes Item Response Theory models to extend the assessment of DIF to an arbitrary number of background variables. This facilitates more complex analyses such as DIF across more than two groups (e.g. low/middle/high socioeconomic status), across more than one background variable (e.g. DIF by race/ethnicity and gender), across non‐categorical background variables (e.g. DIF by parental income), and more. Framing MNLFA as a generalization of the two‐parameter logistic IRT model, we introduce the model with an emphasis on the parameters representing DIF versus impact; describe the current state of the art for estimating MNLFA models; and illustrate the application of MNLFA in a scenario where one wants to test for DIF across two background variables at once.
- Research Article
9
- 10.1177/0013164414526881
- Mar 20, 2014
- Educational and Psychological Measurement
Conventional differential item functioning (DIF) detection methods (e.g., the Mantel–Haenszel test) can be used to detect DIF only across observed groups, such as gender or ethnicity. However, research has found that DIF is not typically fully explained by an observed variable. True sources of DIF may include unobserved, latent variables, such as personality or response patterns. The factor mixture model (FMM) is designed to detect unobserved sources of heterogeneity in factor models. The current study investigated use of the FMM for detecting between-class latent DIF and class-specific observed DIF. Factors that were manipulated included the DIF effect size and the latent class probabilities. The performance of model fit indices (Akaike information criterion [AIC], Bayesian information criterion [BIC], sample size–adjusted BIC, and consistent AIC) were assessed for their detection of the correct DIF model. The recovery of DIF parameters was also assessed. Results indicated that use of FMMs with binary outcomes performed well in terms of the DIF detection and for recovery of large DIF effects. When class probabilities were unequal with small DIF effects, performance decreased for fit indices, power, and the recovery of DIF effects compared with equal class probability conditions. Inflated Type I errors were found for non-DIF items across simulation conditions. Results and future research directions for applied and methodological are discussed.
- Research Article
16
- 10.1080/15305058.2012.692415
- Jul 1, 2013
- International Journal of Testing
We evaluate the item response theory with covariates (IRT-C) procedure for assessing differential item functioning (DIF) without preknowledge of anchor items (Tay, Newman, & Vermunt, 2011). This procedure begins with a fully constrained baseline model, and candidate items are tested for uniform and/or nonuniform DIF using the Wald statistic. Candidate items are selected in turn based on high unconditional bivariate residual (UBVR) values. This iterative process continues until no further DIF is detected or the Bayes information criterion (BIC) increases. We expanded on the procedure and examined the use of conditional bivariate residuals (CBVR) to flag for DIF; aside from the BIC, alternative stopping criteria were also considered. Simulation results showed that the IRT-C approach for assessing DIF performed well, with the use of CBVR yielding slightly better power and Type I error rates than UBVR. Additionally, using no information criterion yielded higher power than using the BIC, although Type I error rates were generally well controlled in both cases. Across the simulation conditions, the IRT-C procedure produced results similar to the Mantel-Haenszel and MIMIC procedures.
- Research Article
3
- 10.1177/01466216211066606
- Feb 10, 2022
- Applied Psychological Measurement
Differential item functioning (DIF) analysis is one of the most important applications of item response theory (IRT) in psychological assessment. This study examined the performance of two Bayesian DIF methods, Bayes factor (BF) and deviance information criterion (DIC), with the generalized graded unfolding model (GGUM). The Type I error and power were investigated in a Monte Carlo simulation that manipulated sample size, DIF source, DIF size, DIF location, subpopulation trait distribution, and type of baseline model. We also examined the performance of two likelihood-based methods, the likelihood ratio (LR) test and Akaike information criterion (AIC), using marginal maximum likelihood (MML) estimation for comparison with past DIF research. The results indicated that the proposed BF and DIC methods provided well-controlled Type I error and high power using a free-baseline model implementation, their performance was superior to LR and AIC in terms of Type I error rates when the reference and focal group trait distributions differed. The implications and recommendations for applied research are discussed.
- Research Article
- 10.1093/sleep/zsaf090.0507
- May 19, 2025
- SLEEP
Introduction Insomnia is common among veterans, particularly those with mental health conditions like depression and anxiety and can lead to significant health complications. Routine screening in healthcare settings is crucial to prevent chronic insomnia. The Insomnia Severity Index (ISI), a widely used and validated tool, has been adapted for diverse populations, but its differential item functioning (DIF) remains underexplored. This study uses moderated nonlinear factor analysis (MNLFA) to address this gap. This flexible approach allows for simultaneous modeling of multiple sources of bias based on individual characteristics, which can improve accuracy of insomnia severity ratings. Methods Veterans (N = 620) from the Miami VA sleep center completed a baseline psychosocial assessment, HSAT (mean AHI=18), and medical/psychiatric diagnoses were extracted from medical records. MNLFA was used to model nighttime (ISI items1a,b,c) and daytime symptoms (items 2–5) separately, examining the effects of age, gender, race/ethnicity, depression, anxiety, PTSD, and chronic pain on DIF. DIF-adjusted factor scores, confirmatory factor analysis (CFA) factor scores, and sum scores were compared. Results The veteran sample (N=620) was middle-aged (M=52, SD=14.5), predominantly male (83.5%), and White (57.3%), with 50% diagnosed with chronic pain and 51% with clinical depression. DIF analysis showed ISI1 had intercept bias for age, Hispanic/White identity, chronic pain, and depression, as well as factor loading bias for age. ISI3 had intercept bias for depression. ISI4 exhibited intercept and factor loading bias for male gender. ISI5 and ISI6 showed intercept bias for age, and ISI7 showed both intercept and factor-loading bias for PTSD. No DIF was found for AHI. Factor scores derived from MNLFA, CFA, and sum scores were highly correlated across both factors. Conclusion This study examined the DIF of ISI by investigating how an array of psychosocial factors influences insomnia severity ratings. Six of the seven ISI items demonstrated bias based on age, gender, race, depression, PTSD, and chronic pain. Differences observed between groups with these characteristics may be influenced. MNLFA demonstrated methodological advantages by allowing simultaneous modeling of DIF testing. Although difficult to implement in primary care, MNLFA-based factor scores hold promise for secondary predictive models. Support (if any)
- Supplementary Content
83
- 10.3200/jexe.72.3.221-261
- Apr 1, 2004
- The Journal of Experimental Education
Scale indeterminacy in analysis of differential item functioning (DIF) within the framework of item response theory can be resolved by imposing 3 anchor item methods: the equal-mean-difficulty method, the all-other anchor item method, and the constant anchor item method. In this article, applicability and limitations of these 3 methods are discussed and their performances of DIF detection are compared using Monte Carlo simulations within the family of Rasch models (Rasch, 1960). The results show that when the test contained multiple DIF items, only when the difference in the mean item difficulties between the reference and focal groups approached zero did the equal-mean-difficulty method and the all-other method function appropriately. In contrast, the constant method yielded unbiased parameter estimates, well-controlled Type I error, and high power of DIF detection, regardless of large differences in the mean item difficulties between groups and high percentages of DIF items in the tests. In addition, the more anchor items in the constant method, the higher the power of detecting DIF. Therefore, the constant anchor item method is recommended when conducting DIF analysis. Methods of locating anchor items for implementing the constant method are also discussed.
- Research Article
9
- 10.1177/0013164413506222
- Oct 16, 2013
- Educational and Psychological Measurement
Invariant relationships in the internal mechanisms of estimating achievement scores on educational tests serve as the basis for concluding that a particular test is fair with respect to statistical bias concerns. Equating invariance and differential item functioning are both concerned with invariant relationships yet are treated separately in the psychometric literature. Connecting these two facets of statistical invariance is critical for developing a holistic definition of fairness in educational measurement, for fostering a deeper understanding of the nature and causes of equating invariance and a lack thereof, and for providing practitioners with guidelines for addressing reported score-level equity concerns. This study hypothesizes that differential item functioning manifested in anchor items of an assessment will have an effect on equating dependence. Findings show that when anchor item differential item functioning varies across forms in a differential manner across subpopulations, population invariance of equating can be compromised.
- Research Article
238
- 10.1037/met0000077
- Sep 1, 2017
- Psychological methods
The evaluation of measurement invariance is an important step in establishing the validity and comparability of measurements across individuals. Most commonly, measurement invariance has been examined using 1 of 2 primary latent variable modeling approaches: the multiple groups model or the multiple-indicator multiple-cause (MIMIC) model. Both approaches offer opportunities to detect differential item functioning within multi-item scales, and thereby to test measurement invariance, but both approaches also have significant limitations. The multiple groups model allows 1 to examine the invariance of all model parameters but only across levels of a single categorical individual difference variable (e.g., ethnicity). In contrast, the MIMIC model permits both categorical and continuous individual difference variables (e.g., sex and age) but permits only a subset of the model parameters to vary as a function of these characteristics. The current article argues that moderated nonlinear factor analysis (MNLFA) constitutes an alternative, more flexible model for evaluating measurement invariance and differential item functioning. We show that the MNLFA subsumes and combines the strengths of the multiple group and MIMIC models, allowing for a full and simultaneous assessment of measurement invariance and differential item functioning across multiple categorical and/or continuous individual difference variables. The relationships between the MNLFA model and the multiple groups and MIMIC models are shown mathematically and via an empirical demonstration. (PsycINFO Database Record
- Research Article
3
- 10.1016/j.addbeh.2021.107088
- Aug 17, 2021
- Addictive Behaviors
Comprehensive measurement invariance of alcohol outcome expectancies among adolescents using regularized moderated nonlinear factor analysis
- Research Article
- 10.1016/j.drugalcdep.2021.109068
- Sep 24, 2021
- Drug and Alcohol Dependence
An application of moderated nonlinear factor analysis to develop a commensurate measure of alcohol problems across four alcohol treatment studies
- Research Article
18
- 10.1097/mlr.0b013e318207edb5
- May 1, 2011
- Medical Care
To propose a permutation-based approach of anchor item detection and evaluate differential item functioning (DIF) related to language of administration (English vs. Spanish) for 9 questions assessing patients' perceptions of their providers from the Consumer Assessment of Healthcare Providers and Systems (CAHPS) Medicare 2.0 survey. METHOD AND STUDY DESIGN: CAHPS 2.0 health plan survey data collected from 703 Hispanics who completed the survey in Spanish were matched on personal characteristics to 703 Hispanics that completed the survey in English. Steps to be followed for the detection of anchor items using the permutation tests are proposed and these tests in conjunction with item response theory were used for the identification of anchor items and DIF detection. Of the questions studied, 4 were selected as anchor items and 3 of the remaining questions were found to have DIF (P < 0.05). The 3 questions with DIF asked about seeing the doctor within 15 minutes of the appointment time, respect for what patients had to say, and provider spending enough time with patients. Failure to account for language differences in CAHPS survey items may result in misleading conclusions about disparities in health care experiences between Spanish and English speakers. Statistical adjustments are needed when using the items with DIF.
- Research Article
113
- 10.1111/j.1745-3984.2001.tb01121.x
- Jun 1, 2001
- Journal of Educational Measurement
Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non‐equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11‐member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests.
- Research Article
15
- 10.3102/10769986221109208
- Jul 18, 2022
- Journal of Educational and Behavioral Statistics
Differential item functioning (DIF) occurs when the probability of endorsing an item differs across groups for individuals with the same latent trait level. The presence of DIF items may jeopardize the validity of an instrument; therefore, it is crucial to identify DIF items in routine operations of educational assessment. While DIF detection procedures based on item response theory (IRT) have been widely used, a majority of IRT-based DIF tests assume predefined anchor (i.e., DIF-free) items. Not only is this assumption strong, but violations to it may also lead to erroneous inferences, for example, an inflated Type I error rate. We propose a general framework to define the effect sizes of DIF without a priori knowledge of anchor items. In particular, we quantify DIF by item-specific residuals from a regression model fitted to the true item parameters in respective groups. Moreover, the null distribution of the proposed test statistic using robust estimator can be derived analytically or approximated numerically even when there is a mix of DIF and non-DIF items, which yields asymptotically justified statistical inference. The Type I error rate and the power performance of the proposed procedure are evaluated and compared with the conventional likelihood-ratio DIF tests in a Monte Carlo experiment. Our simulation study has shown promising results in controlling Type I error rate and power of detecting DIF items. Even when there is a mix of DIF and non-DIF items, the true and false alarm rate can be well controlled when a robust regression estimator is used.
- Research Article
4
- 10.3102/10769986231226439
- Feb 5, 2024
- Journal of Educational and Behavioral Statistics
Testing for differential item functioning (DIF) has undergone rapid statistical developments recently. Moderated nonlinear factor analysis (MNLFA) allows for simultaneous testing of DIF among multiple categorical and continuous covariates (e.g., sex, age, ethnicity, etc.), and regularization has shown promising results for identifying DIF among many covariates. However, computationally inefficient estimation methods have hampered practical use of the regularized MNFLA method. We develop a penalized expectation–maximization (EM) algorithm with soft- and firm-thresholding to more efficiently estimate regularized MNLFA parameters. Simulation and empirical results show that, compared to previous implementations of regularized MNFLA, the penalized EM algorithm is faster, more flexible, and more statistically principled. This method also yields similar recovery of DIF relative to previous implementations, suggesting that regularized DIF detection remains a preferred approach over traditional methods of identifying DIF.
- Research Article
31
- 10.5897/jdae.9000032
- Jan 31, 2010
- Journal of Development and Agricultural Economics
Information criteria provide an attractive basis for model selection. However, little is understood about their relative performance in asymmetric price transmission modelling framework. To explore this issue, this research evaluated the performance of the two commonly used model selection criteria, Akaike information criteria (AIC) and Bayesian information criteria (BIC) in discriminating between asymmetric price transmission models under various conditions. Monte Carlo experimentation indicated that the performance of the different model selection criteria are affected by the size of the data, the level of asymmetry and the amount of noise in the model used in the application. The Bayesian information criterion is consistent and outperforms AIC in selecting the suitable asymmetric price relationship in large samples. Key words: Model selection, Akaike’s information criteria (AIC), Bayesian information criteria (BIC), asymmetry, Monte Carlo.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.