Abstract
BackgroundComputerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments. Differences in how respondents answer items administered by CAT vs. P&P can increase error in CAT-estimated measures if not identified and corrected.MethodTwo methods for detecting item-level mode effects are proposed using Bayesian estimation of posterior distributions of item parameters: (1) a modified robust Z (RZ) test, and (2) 95% credible intervals (CrI) for the CAT-P&P difference in item difficulty. A simulation study was conducted under the following conditions: (1) data-generating model (one- vs. two-parameter IRT model); (2) moderate vs. large DIF sizes; (3) percentage of DIF items (10% vs. 30%), and (4) mean difference in θ estimates across modes of 0 vs. 1 logits. This resulted in a total of 16 conditions with 10 generated datasets per condition.ResultsBoth methods evidenced good to excellent false positive control, with RZ providing better control of false positives and with slightly higher power for CrI, irrespective of measurement model. False positives increased when items were very easy to endorse and when there with mode differences in mean trait level. True positives were predicted by CAT item usage, absolute item difficulty and item discrimination. RZ outperformed CrI, due to better control of false positive DIF.ConclusionsWhereas false positives were well controlled, particularly for RZ, power to detect DIF was suboptimal. Research is needed to examine the robustness of these methods under varying prior assumptions concerning the distribution of item and person parameters and when data fail to conform to prior assumptions. False identification of DIF when items were very easy to endorse is a problem warranting additional investigation.
Highlights
Computerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments
Both methods evidenced good to excellent false positive control, with robust Z (RZ) providing better control of false positives and with slightly higher power for credible interval (CrI), irrespective of measurement model
True positives were predicted by CAT item usage, absolute item difficulty and item discrimination
Summary
Computerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments. Computerized adaptive testing (CAT) is widely used in education and has gained acceptance as a mode for administering health outcomes measures [1,2]. CAT offers several potential advantages over conventional (e.g., paper-and-pencil) administration, including automated scoring and storage of questionnaire data, and reduction of respondent burden. Instruments developed for paper-and-pencil administration frequently mode effects, in other words, may have a greater effect on CAT compared to other assessment modalities. The shift in item parameters resulting from changes in administration mode reflects the presence of differential item functioning (DIF), which can be defined as differential performance (e.g., differences in level of endorsement) of an item between two or more groups matched on the total score or measure [7,8]. This paper will focus on the detection of DIF between CAT and paper-and-pencil administrations of a measure
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have