ABSTRACTIn order to meet the needs of the Test of English as a Foreign Language (TOEFL®) constituencies, the TOEFL program is sponsoring a development project known as TOEFL 2000. Drawing from current linguistic theory and models of communicative competence, it is anticipated that the new test or test battery developed by the TOEFL 2000 project will likely be designed to test all four language skills — reading, writing, listening, and speaking — in an integrated fashion. However, one compromise level or position on integration of skills is one in which reading and writing would be tested together, and listening and speaking also tested together. It is also assumed that the test will largely be performance‐based, meaning a substantial portion of the items on the test will likely be constructed‐response items, and an examinee's score on such items will be in one of multiple ordered categories.Two groups of item response theory (IRT) models have been developed to calibrate items with multiple ordered categories (i.e., polytomously scored items): (a) the partial credit model (Masters, 1982) and the generalized partial credit model (Muraki, 1992); and (b) the graded response model (Samejima, 1969,1972). These models have been used jointly with the dichotomous three parameter logistic (3PL) IRT model to concurrently calibrate dichotomously and polytomously scored items for the National Assessment of Educational Progress (NAEP). However, the performance of these polytomous IRT models and the concurrent calibration of dichotomous and polytomous scored items have not been investigated with data from the TOEFL examinee population.The purpose of this study was to obtain a good understanding of the performance of a combination of dichotomous and polytomous IRT models with TOEFL data. TOEFL Vocabulary and Reading Comprehension and Test of Written English (TWE®) items, and TOEFL Listening Comprehension and Test of Spoken English (TSE®) items were concurrently calibrated using a combination of the generalized partial credit model and the 3PL IRT model. The two sets of combined items were also concurrently calibrated using a combination of the graded response model and the 3PL IRT model.The results of this study indicate that data from a reading/writing combination made up of the TOEFL Vocabulary and Reading Comprehension section and the TWE were reasonably well fit by a combination of the 3PL and generalized partial credit models or 3PL and graded response models. In a similar fashion, data for a listening/speaking combination made up of the TOEFL Listening Comprehension section and selected tasks from the TSE were also reasonably well fit by the 3PL/generalized partial credit and 3PL/graded response model combinations.A variety of comparisons across the generalized partial credit and graded response models seem to indicate some preference for using the generalized partial credit model when PARSCALE is used as the calibration program. The results of this study provide useful information about test construction and item calibration procedures that might later be used for the TOEFL 2000 project.
Read full abstract