Abstract
To validly assess teachers’ pedagogical content knowledge (PCK), performance-based tasks with open-response formats are required. Automated scoring is considered an appropriate ap-proach to reduce the resource-intensity of human scoring and to achieve more consistent scor-ing results than human raters. The focus is on the comparability of human and automated scor-ing of PCK for economics teachers. The answers of (prospective) teachers (N=852) to six open-response tasks from a standardized and validated test were scored by two trained human raters and the engine "Educational SCoRIng TOolkit" (ESCRITO). The average agreement between human and computer ratings, κw = .66, suggests a convergent validity of the scoring results. The results of the single-sector variance analysis show a significant influence of the answers for each homogeneous subgroup (students = 460, trainees = 230, in-service teachers = 162) on the automated scoring. Findings are discussed in terms of implications for the use of automated scoring in educational assessment and its potentials and limitations.
Highlights
Teaching a subject requires teachers to make the structure and meaning of the learning content accessible to learners, taking into account their individual learning prerequisites and needs (Kersting et al, 2014; Wilson et al, 2018)
Automated scoring is subdivided into the scoring of essays by “automated essay scoring” (AES) and the scoring of shortresponse texts by “automated short answer scoring” (ASAS) (Riordan et al, 2017)
With regard to the RQ1, the results show an almost complete agreement between the two human raters for all samples (2011: κw = 0.87; 2018: κw = 0.91; 2011/2018: κw = 0.89) (Table 3)
Summary
Teaching a subject requires teachers to make the structure and meaning of the learning content accessible to learners, taking into account their individual learning prerequisites and needs (Kersting et al, 2014; Wilson et al, 2018). To validly assess PCK, performancebased tasks with open-response formats are required (Alonzo et al, 2012; Zlatkin-Troitschanskaia et al, 2019), where test takers can describe their instructional approaches to teaching situations (Shavelson, 2009; Liu et al, 2016). The scoring of open responses by human raters is a resourceintensive process (Dolan and Burling, 2012; Zhang, 2013) and can lead to inconsistencies in the test scores due to personal rater biases, which limits objective, reliable and valid measurement (Bejar, 2012; Liu et al, 2014). Automated scoring is considered an approach to reduce the resource intensity of scoring and achieve more consistent scoring results (Shermis et al, 2013; Zhang, 2013; Almond, 2014; Burrows et al, 2015). Differences between human and computer-based scorings may exist due to personal and dataset-related influences, for instance, gender or response length, or because of limitations of computer-based modeling (Bridgeman et al, 2012; Ramineni et al, 2012a,b; Perelman, 2014; Zehner et al, 2018)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.