Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Using Firth’s Penalized Maximum Likelihood Estimation for Logistic Regression to Detect Polytomous Differential Item Functioning

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study compares Firth’s penalized maximum likelihood (PML) and traditional maximum likelihood (ML) estimation in logistic regression for polytomous differential item functioning detection, finding that PML offers higher power in unbalanced samples and non-uniform DIF conditions but with increased Type I error rates, providing practical guidance for various sample sizes and item difficulties.

Abstract
Translate article icon Translate Article Star icon

ABSTRACT Logistic regression is one of the common methods for differential item functioning (DIF) detection and is typically estimated by maximum likelihood estimation (ML), despite the fact that ML may produce biased estimates in situations of rare event data and small sample sizes. Firth’s penalization method (PML) may address this issue. In the current study, we compared PML with ML estimation in logistic regression for polytomous DIF detection, focusing on rare event, small, and unbalanced data. The manipulated factors included item average difficulties, sample sizes, group impact, and DIF types and magnitudes. PML demonstrated higher power than ML in unbalanced samples, especially in non-uniform DIF conditions, but also demonstrated a trade-off of having higher Type I error rates than ML. We provide practitioners with DIF estimation suggestions across various sample sizes and item difficulties with respect to concerns of power and Type I error rates.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.7939/r3bv7b69d
The effect of large ability differences on type I error and power rates using SIBTEST and TESTGRAF DIF detection procedures
  • Apr 1, 2002
  • University of Alberta Library
  • Andrea Gotzmann

A simulation study was conducted to examine the effect of large ability differences using two differential item functioning (DIF) detection procedures, SIBTEST and TESTGRAF. DIF items are hard to identify when group ability differences are large (Gotzmann, Vandenberghe, & Gierl, 2000; Hambleton & Rogers, 1989). This problem was investigated in the current study for the SIBTEST and TESTGRAF DIF detection procedures. Four ability differences (0.0, -1.0, -1.5, -2.0) and eight sample sizes (500/500, 750/1000, 1000/1000, 750/1500, 1000/1500, 1500/1500, 1000/2000, 2000/2000) were manipulated in a simulation study. Type I error and power rates were computed. The SIBTEST Type I error rates were inflated at the larger abilitjt differences. Conversely, the TESTGRAF Type I error rates remained low for most ability differences and sample sizes. The SIBTEST power rates remained high, even with larger ability differences. The TESTGRAF power rates dropped as ability differences were introduced. Ability Differences 3 The Effects of Large Ability Differences on Type I Error and Power Rates using the SIBTEST and TESTGRAF DIF Detection Procedures Educational practitioners and test developers often find large test scores differences when comparing examinees with diverse ethnic backgrounds (Berends & Koretz, 1996; Cameron, 1990; Freed le & Kostin, 1990; Scheuneman & Grima, 1997; Schmitt & Dorans, 1990). Reducing these differences is one goal in the educational reform movement (Barron & Koretz, 1996). These large test score differences are particularly noteworthy when Native and non-Native examinees are compared (Alberta Education, 1996; Gotzmann, Vandenberghe, & Gierl, 2000; Hambleton & Rogers, 1989; Vandenberghe & Gierl, 2001). Socioeconomic and cultural differences may contribute to these performance differences (Common & Frost, 1989; Hull, 1990; Trent & Gilman, 1985; Wood & Clay, 1996). However, few researchers have studied item-level outcomes which may explain why Native examinees score lower than non-Native examinees (Gotzmann et al., 2000; Hambleton & Rogers, 1989). Native examinee scores may be biased due to factors in test development. For example, Janzen (2000) and Krywaniuk and Das (1976) found that Native children are more likely to use simultaneous processing skills and non-Native children are more likely to use successive processing skills. If exams have a small number of items that illicit simultaneous processing skills, then these exams may put Native examinees at a disadvantage. Therefore, assessment of bias at the item level, and its contribution to the total test score differences, should be studied. Item bias can be estimated with different methods. Traditionally, item-level differences between groups have been assessed by comparing the proportion correct Ability Differences 4 for each group (Lord, 1980). However, this method has one major flaw. The proportion correct method compares all examinees, regardless of ability level. Thus, the proportion correct is dependent upon the sample of examinees (see Camilli & Shepard, 1994). To overcome this problem, statistical methods can be used to determine whether differential item functioning (DIF) is present. DIF occurs when examinees from different groups have a different probability of answering the ite-m,ebrrectly, after controlling for overall ability. In these comparisons, the majority group is called the reference group and the minority group is called the focal group. DIF methods are used to estimate bias by matching examinees on an internal measure of ability or overall test score performance and comparing these examinees at the item level. This approach removes total test score differences in the estimation process, which provides a stronger measure of the actual group differences on the item. There are many statistical procedures to estimate DIF including Item Response Theory (IRT) area measures (Lord, 1980; Thissen, Steinberg, & Wainer, 1988), MantelHaenszel (Holland & Thayer, 1988), Logistic Regression (Swaminathan & Rogers, 1990), Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, 1993), and TESTGRAF (Ramsay, 1991, 2000). Most of these procedures have been used to identify DIF between ethnic groups. However, the SIBTEST and TESTGRAF procedures may be suitable when large ability differences are found. Further, both of these DIF detection procedures can be used with small sample sizes and both yield comparable DIF measures (Ramsay, 1991; 2000; Shealy & Stout, 1993). However, these procedures also have a noteworthy difference. SIBTEST uses a regression correction to estimate

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.1155/2020/1632350
A Comparative Study of the Bias Correction Methods for Differential Item Functioning Analysis in Logistic Regression with Rare Events Data
  • Jan 1, 2020
  • BioMed Research International
  • Marjan Faghih + 4 more

The logistic regression (LR) model for assessing differential item functioning (DIF) is highly dependent on the asymptotic sampling distributions. However, for rare events data, the maximum likelihood estimation method may be biased and the asymptotic distributions may not be reliable. In this study, the performance of the regular maximum likelihood (ML) estimation is compared with two bias correction methods including weighted logistic regression (WLR) and Firth's penalized maximum likelihood (PML) to assess DIF for imbalanced or rare events data. The power and type I error rate of the LR model for detecting DIF were investigated under different combinations of sample size, moderate and severe magnitudes of uniform DIF (DIF = 0.4 and 0.8), sample size ratio, number of items, and the imbalanced degree (τ). Indeed, as compared with WLR and for severe imbalanced degree (τ = 0.069), there were reductions of approximately 30% and 24% under DIF = 0.4 and 27% and 23% under DIF = 0.8 in the power of the PML and ML, respectively. The present study revealed that the WLR outperforms both the ML and PML estimation methods when logistic regression is used to evaluate DIF for imbalanced or rare events data.

  • Research Article
  • Cite Count Icon 23
  • 10.1177/00131649921970251
A Comparison of Logistic Regression and Analysis of Variance Differential Item Functioning Detection Methods
  • Dec 1, 1999
  • Educational and Psychological Measurement
  • Marjorie L Whitmore + 1 more

Differential item functioning (DIF) detection rates were compared between logistic regression and analysis of variance for dichotomously scored items. These two DIF methods were compared using simulated binary item response data sets of varying test length (20, 40, and 60 items), sample size (200, 400, and 600 examinees), discrimination type (fixed and varying), and relative underlying ability (equal and unequal) between groups under conditions of uniform DIF, nonuniform DIF, combination DIF, and false positive errors. These test conditions were replicated 100 times. For both DIF detection methods, a test length of 20 items was sufficient for satisfactory DIF detection with detection rate increasing as sample size increased. With the exception of uniform DIF, the logistic regression method had higher mean detection rates than the analysis of variance method. Because the type of DIF present in real data is rarely known, the logistic regression method is recommended for most practical applications.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s12564-009-9039-7
Examining type I error and power for detection of differential item and testlet functioning
  • Jun 10, 2009
  • Asia Pacific Education Review
  • Young-Sun Lee + 2 more

In this study, the effectiveness of detection of differential item functioning (DIF) and testlet DIF using SIBTEST and Poly-SIBTEST were examined in tests composed of testlets. An example using data from a reading comprehension test showed that results from SIBTEST and Poly-SIBTEST were not completely consistent in the detection of DIF and testlet DIF. Results from a simulation study indicated that SIBTEST appeared to maintain type I error control for most conditions, except in some instances in which the magnitude of simulated DIF tended to increase. This same pattern was present for the Poly-SIBTEST results, although Poly-SIBTEST demonstrated markedly less control of type I errors. Type I error control with Poly-SIBTEST was lower for those conditions for which the ability was unmatched to test difficulty. The power results for SIBTEST were not adversely affected, when the size and percent of simulated DIF increased. Although Poly-SIBTEST failed to control type I errors in over 85% of the conditions simulated, in those conditions for which type I error control was maintained, Poly-SIBTEST demonstrated higher power than SIBTEST.

  • Research Article
  • Cite Count Icon 113
  • 10.1177/0013164406294781
Iterative Purification and Effect Size Use With Logistic Regression for Differential Item Functioning Detection
  • Jun 1, 2007
  • Educational and Psychological Measurement
  • Brian F French + 1 more

Two unresolved implementation issues with logistic regression (LR) for differential item functioning (DIF) detection include ability purification and effect size use. Purification is suggested to control inaccuracies in DIF detection as a result of DIF items in the ability estimate. Additionally, effect size use may be beneficial in controlling Type I error rates. The effectiveness of such controls, especially used in combination, requires evaluation. Detection errors were evaluated through simulation across iterative purification and no purification procedures with and without the use of an effect size criterion. Sample size, DIF magnitude and percentage, and ability differences were manipulated. Purification was beneficial under certain conditions, although overall power and Type I error rates did not substantially improve. The LR statistical test without purification performed as well as other classification criteria and may be the practical choice for many situations. Continued evaluation of the effect size guidelines and purification are discussed.

  • Research Article
  • Cite Count Icon 474
  • 10.1207/s15324818ame1404_2
Evaluating Type I Error and Power Rates Using an Effect Size Measure With the Logistic Regression Procedure for DIF Detection
  • Oct 1, 2001
  • Applied Measurement in Education
  • Michael G Jodoin + 1 more

The logistic regression (LR) procedure for differential item functioning (DIF) detection is a model-based approach designed to identify both uniform and nonuniform DIF. However, this procedure tends to produce inflated Type I errors. This outcome is problematic because it can result in the inefficient use of testing resources, and it may interfere with the study of the underlying causes of DIF. Recently, an effect size measure was developed for the LR DIF procedure and a classification method was proposed. However, the effect size measure and classification method have not been systematically investigated. In this study, we developed a new classification method based on those established for the Simultaneous Item Bias Test. A simulation study also was conducted to determine if the effect size measure affects the Type I error and power rates for the LR DIF procedure across sample sizes, ability distributions, and percentage of DIF items included on a test. The results indicate that the inclusion of the effect size measure can substantially reduce Type I error rates when large sample sizes are used, although there is also a reduction in power.

  • Research Article
  • Cite Count Icon 18
  • 10.1177/0146621611420559
Accuracy of DIF Estimates and Power in Unbalanced Designs Using the Mantel–Haenszel DIF Detection Procedure
  • Oct 1, 2011
  • Applied Psychological Measurement
  • Insu Paek + 1 more

This study examined how much improvement was attainable with respect to accuracy of differential item functioning (DIF) measures and DIF detection rates in the Mantel–Haenszel procedure when employing focal and reference groups with notably unbalanced sample sizes where the focal group has a fixed small sample which does not satisfy the minimum DIF sample size requirement specified by the testing programs, while the reference group sample size far exceeds the minimum requirement. Results showed equivalent or better results with such unbalanced but large samples than with some of the currently used minimum DIF sample size conditions. DIF investigation, therefore, does not necessarily need to cease when the focal group does not meet the minimum sample size requirement. Some analytic explanations and guidelines for DIF investigations with unbalanced sample sizes are also provided.

  • Research Article
  • Cite Count Icon 23
  • 10.3102/1076998616659371
Detection of Uniform and Nonuniform Differential Item Functioning by Item-Focused Trees
  • Jul 28, 2016
  • Journal of Educational and Behavioral Statistics
  • Moritz Berger + 1 more

Detection of differential item functioning (DIF) by use of the logistic modeling approach has a long tradition. One big advantage of the approach is that it can be used to investigate nonuniform (NUDIF) as well as uniform DIF (UDIF). The classical approach allows one to detect DIF by distinguishing between multiple groups. We propose an alternative method that is a combination of recursive partitioning methods (or trees) and logistic regression methodology to detect UDIF and NUDIF in a nonparametric way. The output of the method are trees that visualize in a simple way the structure of DIF in an item showing which variables are interacting in which way when generating DIF. In addition, we consider a logistic regression method, in which DIF can be induced by a vector of covariates, which may include categorical but also continuous covariates. The methods are investigated in simulation studies and illustrated by two applications.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s11136-022-03129-8
A comparison of methods to address item non-response when testing for differential item functioning in multidimensional patient-reported outcome measures.
  • Apr 7, 2022
  • Quality of life research : an international journal of quality of life aspects of treatment, care and rehabilitation
  • Olawale F Ayilara + 5 more

Item non-response (i.e., missing data) may mask the detection of differential item functioning (DIF) in patient-reported outcome measures or result in biased DIF estimates. Non-response can be challenging to address in ordinal data. We investigated an unsupervised machine-learning method for ordinal item-level imputation and compared it with commonly-used item non-response methods when testing for DIF. Computer simulation and real-world data were used to assess several item non-response methods using the item response theory likelihood ratio test for DIF. The methods included: (a) list-wise deletion (LD), (b) half-mean imputation (HMI), (c) full information maximum likelihood (FIML), and (d) non-negative matrix factorization (NNMF), which adopts a machine-learning approach to impute missing values. Control of Type I error rates were evaluated using a liberal robustness criterion for α = 0.05 (i.e., 0.025-0.075). Statistical power was assessed with and without adoption of an item non-response method; differences > 10% were considered substantial. Type I error rates for detecting DIF using LD, FIML and NNMF methods were controlled within the bounds of the robustness criterion for > 95% of simulation conditions, although the NNMF occasionally resulted in inflated rates. The HMI method always resulted in inflated error rates with 50% missing data. Differences in power to detect moderate DIF effects for LD, FIML and NNMF methods were substantial with 50% missing data and otherwise insubstantial. The NNMF method demonstrated comparable performance to commonly-used non-response methods. This computationally-efficient method represents a promising approach to address item-level non-response when testing for DIF.

  • Research Article
  • Cite Count Icon 3
  • 10.1177/00131644211028995
DIF Detection With Zero-Inflation Under the Factor Mixture Modeling Framework.
  • Jul 26, 2021
  • Educational and psychological measurement
  • Sooyong Lee + 2 more

Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.

  • Research Article
  • Cite Count Icon 35
  • 10.1027/1614-2241.5.1.18
Efficacy of Effect Size Measures in Logistic Regression
  • Jan 1, 2009
  • Methodology
  • Juana Gómez-Benito + 2 more

Statistical techniques based on logistic regression (LR) are adequate for the detection of differential item functioning (DIF) in dichotomous items. Nevertheless, they return more false positives (FPs) than do other DIF detection techniques. This paper compares the efficacy of DIF detection using the LR significance test and the estimation of the effect size that these procedures provide using R2 of Nagelkerke. The variables manipulated were different conditions of sample size, focal and reference group sample size ratio, amount of DIF, test length and percentage of test items with DIF. In addition, examinee responses were generated to simulate both uniform and nonuniform DIF (symmetric and asymmetric). In all cases, dichotomous response tests were used. The results show that the use of R2 as a strategy for detecting DIF obtained lower correct detection percentages than those obtained from significance tests. Moreover, the LR significance test showed adequate control of FP rates, close to the nominal 5%, although the rate was slightly higher than the nominal 5% when the sample size was smaller. However, when the effect size measure was used to detect DIF, the FP rates were lower and <1% for a wide number of conditions. In addition, a statistically significant main effect of the sample size variable was obtained. Thus, the FP percentages were higher when the sample size was small (100/100). The results obtained indicate that the use of R2 as a measure of effect size together with the statistical significance test reduces the rate of FP.

  • Research Article
  • Cite Count Icon 26
  • 10.1007/s11336-021-09775-0
Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions.
  • Sep 1, 2021
  • Psychometrika
  • Jeanne A Teresi + 4 more

Several methods used to examine differential item functioning (DIF) in Patient-Reported Outcomes Measurement Information System (PROMIS®) measures are presented, including effect size estimation. A summary of factors that may affect DIF detection and challenges encountered in PROMIS DIF analyses, e.g., anchor item selection, is provided. An issue in PROMIS was the potential for inadequately modeled multidimensionality to result in false DIF detection. Section 1 is a presentation of the unidimensional models used by most PROMIS investigators for DIF detection, as well as their multidimensional expansions. Section 2 is an illustration that builds on previous unidimensional analyses of depression and anxiety short-forms to examine DIF detection using a multidimensional item response theory (MIRT) model. The Item Response Theory-Log-likelihood Ratio Test (IRT-LRT) method was used for a real data illustration with gender as the grouping variable. The IRT-LRT DIF detection method is a flexible approach to handle group differences in trait distributions, known as impact in the DIF literature, and was studied with both real data and in simulations to compare the performance of the IRT-LRT method within the unidimensional IRT (UIRT) and MIRT contexts. Additionally, different effect size measures were compared for the data presented in Section 2. A finding from the real data illustration was that using the IRT-LRT method within a MIRT context resulted in more flagged items as compared to using the IRT-LRT method within a UIRT context. The simulations provided some evidence that while unidimensional and multidimensional approaches were similar in terms of Type I error rates, power for DIF detection was greater for the multidimensional approach. Effect size measures presented in Section1 and applied in Section2 varied in terms of estimation methods, choice of density function, methods of equating, and anchor item selection. Despite these differences, there was considerable consistency in results, especially for the items showing the largest values. Future work is needed to examine DIF detection in the context of polytomous, multidimensional data. PROMIS standards included incorporation of effect size measures in determining salient DIF. Integrated methods for examining effect size measures in the context of IRT-based DIF detection procedures are still in early stages of development.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1155/2021/6854477
A Machine Learning Approach to Assess Differential Item Functioning in Psychometric Questionnaires Using the Elastic Net Regularized Ordinal Logistic Regression in Small Sample Size Groups
  • Jan 1, 2021
  • BioMed Research International
  • Vahid Ebrahimi + 3 more

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF (DIF = 0.4 and 0.8), sample size (N), sample size ratio (R), scale length (I), and weighting parameter (w). The simulation results revealed that for I = 5 and regardless of R, the elastic net regularized OLR model with w = 0.1, as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF (DIF = 0.4) approximately 35% and 21% for N = 100 and 150, respectively. Moreover, for I = 10 and severe uniform DIF (DIF = 0.8), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06, as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100 and 150, respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.

  • Research Article
  • Cite Count Icon 36
  • 10.1207/s15324818ame1703_2
Performance of SIBTEST When the Percentage of DIF Items is Large
  • Jul 1, 2004
  • Applied Measurement in Education
  • Mark J Gierl + 2 more

Differential item functioning (DIF) analyses are used to identify items that operate differently between two groups, after controlling for ability. The Simultaneous Item Bias Test (SIBTEST) is a popular DIF detection method that matches examinees on a true score estimate of ability. However in some testing situations, like test translation and adaptation, the percentage of DIF items can be large. In these situations, the effectiveness of SIBTEST has not been thoroughly evaluated. The problem is addressed in this study. Four variables were manipulated in a simulation study: The amount of DIF on a 40-item test (20%, 40%, and 60% of the items on the test had moderate and large DIF), the direction of DIF (balanced and unbalanced DIF items), sample size (500, 1,000, 1,500, and 2,000 examinees in each group), and ability distribution differences between groups (equal and unequal). Each condition was replicated 100 times to facilitate the computation of the DIF detection rates. The results from the simulation study indicated that SIBTEST yielded adequate DIF detection rates, even when 60% of the items contained DIF, providing DIF was balanced between the reference and focal groups and sample sizes were at least 1,000 examinees per group. SIBTEST also had adequate detection rates in the 20% unbalanced DIF conditions with samples of 1,000 examinees per group. However, SIBTEST had poor detection rates across all 40% and 60% unbalanced DIF conditions. Implications for practice and future directions for research are discussed.

  • Research Article
  • Cite Count Icon 8
  • 10.1111/j.1745-3984.1991.tb00343.x
Influence of Prior Distributions on Detection of DIF
  • Mar 1, 1991
  • Journal of Educational Measurement
  • Allan S Cohen + 2 more

Detection of differential item functioning (DIF) on items intentionally constructed to favor one group over another was investigated on item parameter estimates obtained from two item response theory‐based computer programs, LOGIST and BILOG. Signed‐ and unsigned‐area measures based on joint maximum likelihood estimation, marginal maximum likelihood estimation, and two marginal maximum a posteriori estimation procedures were compared with each other to determine whether detection of DIF could be improved using prior distributions. Results indicated that item parameter estimates obtained using either prior condition were less deviant than when priors were not used. Differences in detection of DIF appeared to be related to item parameter estimation condition and to some extent to sample size.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant