Abstract
ABSTRACT Multiple-choice questions have become ubiquitous in educational measurement because the format allows for efficient and accurate scoring. Nonetheless, there remains continued interest in constructed-response formats. This interest has driven efforts to develop computer-based scoring procedures that can accurately and efficiently score these items. Early procedures were typically based on surface features of the responses or simple matching procedures, but recent developments in natural language processing have allowed for much more sophisticated approaches. This paper reports on a state-of-the-art methodology for scoring short answer questions supported by a large language model. Responses were collected in the context of a high-stakes test for medical students. More than 35,000 responses were collected across 71 studied items. Aggregated across all responses the proportion of agreement with human scores ranged from .97 to .99 (depending on specifics such as training sample size). In addition to reporting detailed results, the paper discusses practical issues that require consideration when adopting this type of scoring system.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.