How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

İbrahim Uysal,Nuri Doğan

doi:10.21031/epod.817396

Abstract

The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of open-ended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.

Highlights

Individuals experience numerous tests throughout their lives
The research compared automated scoring algorithms with changes made on data rates used in testing the system
SVM, LR, MNB, LSTM, and BLSTM algorithms were compared with each other according to 10%, 20%, and 33% test data rates

Summary

Introduction

Individuals experience numerous tests throughout their lives. Tests show differences in individuals' knowledge, skills and abilities. The use of more than one item format in tests has become more popular. In this approach, which is referred to as a mixed-format test, open-ended items with or without restricted responses are used in addition to the multiple-choice items. In multiple-choice items, individuals encounter one right and more than one wrong answer about a problem. Using only the multiple-choice items in tests affects the teaching and learning process and lead individuals to study for multiple-choice tests. This situation can restrict original, critical, and higher level thinking skills. The use of open-ended items can overcome this limitation

Objectives

Methods

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi	Publication Date: Mar 31, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi

Lead the way for us

Similar Papers

Sentiment Analysis of Public Acceptance of Covid-19 Vaccines Types in Indonesia using Naïve Bayes, Support Vector Machine, and Long Short-Term Memory (LSTM)
Dinar Ajeng Kristiyanti ... Sri Hardani
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 7
Dinar Ajeng Kristiyanti, et. al.Dinar Ajeng Kristiyanti ... Sri Hardani
05 Jun 2023
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) | VOL. 7

Performance comparison of solar radiation forecasting between WRF and LSTM in Gifu, Japan
Jose Manuel Soares De Araujo
Environmental Research Communications | VOL. 2
Jose Manuel Soares De AraujoJose Manuel Soares De Araujo
01 Apr 2020
Environmental Research Communications | VOL. 2

Research on machine learning algorithms and feature extraction for time series
Lei Li ... Yabin Wu
-
Lei Li, et. al.Lei Li ... Yabin Wu
01 Oct 2017
01 Oct 2017

Empowering Digital Civility with an NLP Approach for Detecting Twitter Cyberbullying through Boosted Ensembles
Senthil Prabakaran ... Nagarajan Jeyaraman
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Senthil Prabakaran, et. al.Senthil Prabakaran ... Nagarajan Jeyaraman
07 Oct 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi