Abstract

Theory: When test developers have a limited number of test questions available or when the equating design requires some item overlap across forms, psychometricians worry that examinees who encounter previously seen questions on subsequent test forms may be able to inflate their test score due to their familiarity with the repeated test questions. Hypotheses: Prior exposure to test questions may lead to contamination and inflated scores. This research seeks to detect if examinees' scores were inflated due to prior exposure to test questions and, if so, whether those increases were significant. Method: The sample for this study consisted of candidates who took the American Board of Family Medicine's certification examination twice in a single year (n = 988). Examinees were randomly assigned one of two forms for their first attempt and received the other form for their repeat test. There were 99 questions in common across both forms. The Rasch model was used to estimate examinee ability. Performance changes on the common questions and unique questions were compared and repeated measures t tests were performed to establish whether score changes were likely to have occurred by chance. Results: On average, the examinees increased their overall ability estimate by .187 logits on the repeat attempt. The repeated measures t tests indicate this difference was statistically significant, t(987) = −25.298, p < .001, α = .05. The mean difference between the examinees' ability estimate on common and unique items for their first attempt was not statistically significant, t(987) = .264, p = .792, α = .05; however, the mean difference between common and unique items on the second attempt (0.029 logits) was statistically significant, t(987) = 3.28, p = .001, α = .05. Conclusions: Some of the increase in the examinees' overall ability estimate may attributed to a general increase in the latent trait; however, there was a small but detectable increase that could be attributed to prior exposure to the questions. On average, about 15% of the repeated questions were changed from wrong to right, but about 11% of questions were changed from right to wrong, suggesting that examinees may occasionally be using prior exposure to their benefit but general guessing accounts for more of the changes. The impact of the mean difference between the common and unique item scores (0.029 logits) is trivial at the individual level; however, such a bias among the population of repeat testers could be problematic if a small subset of examinees were using a “remember–research–retest” strategy to obtain nontrivial score increases.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.