AbstractIn international large-scale assessments, student performance comparisons across educational systems are frequently done to assess the state and development in different domains. These results often have a large impact on educational policy and on the perceptions of an educational system’s performance. Early assessments, such as the First and Second International Science Studies (FISS and SISS), have been used alongside recent studies to create unique scales for investigating changes in constructs. The implicit assumptions in system comparisons are that the measures are valid, reliable, and comparable. However, these assumptions have not always been investigated thoroughly. This study aims to investigate the validity and cross-system comparability of scores from the FISS and SISS, conducted by the International Association for the Evaluation of Educational Achievement in 1970–1971 and 1983–1984. Findings based on item response theory (IRT) modeling indicate that scores in most educational systems can be viewed as reliable measures of a single science construct, supporting the validity of test score interpretations in these educational systems individually. In a robust assessment of measurement invariance using standard IRT methods, an alignment-based method, and the root mean square difference (RMSD) fit statistic, we demonstrate that measurement invariance is violated across systems. The alignment-based method identified a well-fitting model with complex restrictions but no items exhibited invariance across all systems, a result supported by the RMSD statistics. These results question the appropriateness of score comparisons across systems in FISS and SISS. We discuss the implications of these results and outline consequences for score comparisons across time.
Read full abstract