The standard setting process: validating interpretations of stakeholders

Nele Kampa,Helene Wagner,Olaf Köller

doi:10.1186/s40536-019-0071-8

Nele Kampa, Helene Wagner + Show 1 more

Open Access

https://doi.org/10.1186/s40536-019-0071-8

Copy DOI

Abstract

BackgroundStakeholders’ interpretations of the findings of large-scale educational assessments can influence important decisions. In the context of educational assessment, standard-setting remains an especially critical element, because it is complex and largely unstandardized. Instruments established by means of standard-setting procedures such as proficiency levels (PL) therefore appear to be arbitrary to some degree. Owing to the significance such results take on, when they are communicated to stakeholders or the public, a thorough validation of this process seems crucial. In our study, ministry stakeholders intended to use PL established in an assessment of science abilities to obtain information about students’ strengths and weaknesses regarding science abilities in general and specifically about the extent to which they were prepared for future science studies. The aim of our study was to investigate the validity arguments regarding these two intended interpretations.MethodsBased on a university science test administered to 3641 upper secondary students (Grade 13), a panel of nine experts set four cut scores using two variations of the Angoff method, the Yes/No Angoff method (multiple choice items) and the extended Angoff method (complex multiple choice items). We carried out t-tests, repeated measures ANOVA, G-studies and regression analyses to support the procedural, internal, external, and consequential validity elements regarding the aforementioned interpretations of the cut scores.ResultsOur t-tests and G-studies showed that the intended use of the cut scores was valid regarding procedural and internal aspects of validity. These findings were called into question by the experts’ lack of confidence in the established cut scores. Regression analyses including number of lessons taught and intended and pursued science-related studies showed good external and poor consequential validity.ConclusionThe cut scores can be used as an indicator of 13th graders’ strengths and weaknesses in science. They should not be used as an indicator for preparedness for science university studies. Since assessment formats are continually evolving and consequently leading to more complex designs, further research needs to be conducted on the application of new standard-setting methods to meet the challenges arising from this development.

Highlights

Most large-scale assessments (LSA) aim to provide findings on the effectiveness of a school system to different stakeholders
The cut scores can be used as an indicator of 13th graders’ strengths and weaknesses in science. They should not be used as an indicator for preparedness for science university studies
Since assessment formats are continually evolving and leading to more complex designs, further research needs to be conducted on the application of new standard-setting methods to meet the challenges arising from this development

Summary

Introduction

Most large-scale assessments (LSA) aim to provide findings on the effectiveness of a school system to different stakeholders (e.g., ministry personnel, teachers or schools). Within these accountability programs, results are almost always reported in a standardbased way (Haertel 2002). The tests, methods, and analyses for these accountability programs are becoming more and more complex, which poses new challenges for standard-setting procedures. This article describes a validity study that investigates validity arguments for the interpretation of cut scores derived from a standard-setting procedure applied to a science ability scale of 13th graders. The aim of our study was to investigate the validity arguments regarding these two intended interpretations

Objectives

Methods

Results

Discussion

Conclusion