Abstract

ABSTRACTPrevious research has shown that a single factor, factor being used in the sense of factor analysis, gave a very good account of item covariances within TOEFL® sections. This result is consistent with the assumption that the product of a person parameter and an item parameter models the probability that the person would pass the item. The assumption forms a simple item response model. A subsequent cross‐validation study using this model supported the efficacy of that assumption by predicting item success as accurately as did the 3‐Parameter Logistic model, and a modified Rasch model. The purpose of the current study was to extend the comparison of models to an equating context. “Equating” is a statistical process that identifies comparable scores from parallel tests administered to different populations. In an operational context, equating serves to facilitate comparison of scores generated on different forms of a test.The present study consisted of simulation trials designed to “equate the test to itself.” That is, equating sample data were generated from administration of identical item sets. It is useful to do this as a test of model validity, because if the same item sets are used to equate, an accurate equating would identify equal scores as comparable. Discrepancies between comparable scores signify error model misfit or random error.Equatings that used procedures based on each model were accomplished under several conditions and the results were compared. The conditions varied by sample size, anchor test difficulty, and the TOEFL section equated. In order to compound the difficulty of the equating task, results were based on equating samples that were mismatched in performance on a correlated measure.Most discrepancies between comparable scores were largest at the extremes. The largest discrepancies between scores identified as comparable occurred for the 3PL and modified Rasch models at the lower extreme scores, and for the simple models at the upper extreme score. For the 1,000‐case sample, most were in fractions of score points. As expected, 3PL equatings exhibited the largest discrepancies for the 100‐case sample. The simple item response model yielded the most discrepancies that were in excess of the standard error of measurement, in part because with that model the maximum discrepancies occurred at the top of the score range, where the standard errors of measurement approach zero. Imposing an upper bound on the probability of correct response in the simple model markedly reduced its errors.TOEFL scores are used for educational decisions. If it is true that most institutions' cut scores occur in the mid‐score ranges, the present study suggests that 3PL should not be used if equating samples are substantially reduced from the present size. The other models are promising for small‐sample equating, with the one‐parameter logistic models being most promising.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.