Abstract
ABSTRACTIn order to meet the needs of the Test of English as a Foreign Language (TOEFL®) constituencies, the TOEFL program is sponsoring a development project known as TOEFL 2000. Drawing from current linguistic theory and models of communicative competence, it is anticipated that the new test or test battery developed by the TOEFL 2000 project will likely be designed to test all four language skills — reading, writing, listening, and speaking — in an integrated fashion. However, one compromise level or position on integration of skills is one in which reading and writing would be tested together, and listening and speaking also tested together. It is also assumed that the test will largely be performance‐based, meaning a substantial portion of the items on the test will likely be constructed‐response items, and an examinee's score on such items will be in one of multiple ordered categories.Two groups of item response theory (IRT) models have been developed to calibrate items with multiple ordered categories (i.e., polytomously scored items): (a) the partial credit model (Masters, 1982) and the generalized partial credit model (Muraki, 1992); and (b) the graded response model (Samejima, 1969,1972). These models have been used jointly with the dichotomous three parameter logistic (3PL) IRT model to concurrently calibrate dichotomously and polytomously scored items for the National Assessment of Educational Progress (NAEP). However, the performance of these polytomous IRT models and the concurrent calibration of dichotomous and polytomous scored items have not been investigated with data from the TOEFL examinee population.The purpose of this study was to obtain a good understanding of the performance of a combination of dichotomous and polytomous IRT models with TOEFL data. TOEFL Vocabulary and Reading Comprehension and Test of Written English (TWE®) items, and TOEFL Listening Comprehension and Test of Spoken English (TSE®) items were concurrently calibrated using a combination of the generalized partial credit model and the 3PL IRT model. The two sets of combined items were also concurrently calibrated using a combination of the graded response model and the 3PL IRT model.The results of this study indicate that data from a reading/writing combination made up of the TOEFL Vocabulary and Reading Comprehension section and the TWE were reasonably well fit by a combination of the 3PL and generalized partial credit models or 3PL and graded response models. In a similar fashion, data for a listening/speaking combination made up of the TOEFL Listening Comprehension section and selected tasks from the TSE were also reasonably well fit by the 3PL/generalized partial credit and 3PL/graded response model combinations.A variety of comparisons across the generalized partial credit and graded response models seem to indicate some preference for using the generalized partial credit model when PARSCALE is used as the calibration program. The results of this study provide useful information about test construction and item calibration procedures that might later be used for the TOEFL 2000 project.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.