Reliability and Validity of PIRLS and TIMSS

Johannes Schult,Jörn R Sparfeldt

doi:10.1027/1015-5759/a000338

Abstract

Abstract. Academic achievements are often assessed in written exams and tests using selection-type (e.g., multiple-choice, MC) and supply-type (e.g., constructed-response, CR) item response formats. The present article examines how MC items and CR items differ with regard to reliability and criterion validity in two educational large-scale assessments with 4th-graders. The reading items of PIRLS 2006 were compiled into MC scales, CR scales, and mixed scales. Scale reliabilities were estimated according to item response theory (international PIRLS sample; n = 119,413). MC showed smaller standard errors than CR around the reading proficiency mean, whereas CR was more reliable for low and high proficiency levels. In the German sample (n = 7,581), there was no format-specific differential validity (criterion: German grades, r ≈ .5; Δr = 0.01). The mathematics items of TIMSS 2007 (n = 160,922) showed similar reliability patterns. MC validity was slightly larger than CR validity (criterion: mathematics grades; n = 5,111; r ≈ .5, Δr = −0.02). Effects of format-specific test extensions were very small in both studies. It seems that in PIRLS and TIMSS, reliability and validity do not depend substantially on response formats. Consequently, other response format characteristics (like the cost of development, administration, and scoring) should be considered when choosing between MC and CR.

Full Text