This article investigates the comparison of two groups based on the two-parameter logistic item response model. It is assumed that there is random differential item functioning in item difficulties and item discriminations. The group difference is estimated using separate calibration with subsequent linking, as well as concurrent calibration. The following linking methods are compared: mean-mean linking, log-mean-mean linking, invariance alignment, Haberman linking, asymmetric and symmetric Haebara linking, different recalibration linking methods, anchored item parameters, and concurrent calibration. It is analytically shown that log-mean-mean linking and mean-mean linking provide consistent estimates if random DIF effects have zero means. The performance of the linking methods was evaluated through a simulation study. It turned out that (log-)mean-mean and Haberman linking performed best, followed by symmetric Haebara linking and a newly proposed recalibration linking method. Interestingly, linking methods frequently found in applications (i.e., asymmetric Haebara linking, recalibration linking used in a variant in current large-scale assessment studies, anchored item parameters, concurrent calibration) perform worse in the presence of random differential item functioning. In line with the previous literature, differences between linking methods turned out be negligible in the absence of random differential item functioning. The different linking methods were also applied in an empirical example that performed a linking of PISA 2006 to PISA 2009 for Austrian students. This application showed that estimated trends in the means and standard deviations depended on the chosen linking method and the employed item response model.
Read full abstract