Cross-Prompt Based Automatic Short Answer Grading System

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Research on Automatic Short Answer Grading (ASAG) has shown promising results in recent years. However, several important research gaps remain. Based on the literature review, the researchers identify two critical issues. First, the majority of ASAG models are trained and tested on responses to the same prompt which raises concerns about their robustness accross different prompts. Second, many existing approaches typically treat grading task as a binary classification problem. The research aims to bridge these gaps by developing an ASAG system that closely reflects real-world assessment scenarios through multiclass classification approach and cross-prompt evaluation. It is implemented by training the proposed models on 1,505 responses across 9 prompts and testing on 175 responses from 3 distinct prompts. The grading task is addressed using regression and classification techniques, including Linear Regression, Logistic Regression, Extreme Gradient Boosting (Xg-Boost), Adaptive Boosting (AdaBoost), and K-Nearest Neighbors (as a baseline). The grades are categorized into five classes that are represented by grade A to E. Both manual and algorithmic data augmentation techniques, including Syntactic Minority Oversampling Technique (SMOTE), are employed to address class imbalance in the sample data. Across multiple testing scenarios, all five models demonstrate consistent performance, with Linear Regression outperforming others. During the validation process, it achieves a high accuracy of 0.93, indicating its ability to correctly classify the responses. In the testing phase, it achieves a weighted F1-Score of 0.79, a macroaveraged F1-Score of 0.75, and an RMSE of 0.45. The results suggest relatively low prediction error.

Similar Papers
  • Research Article
  • 10.52783/jisem.v10i51s.10392
Performance Analysis of Transformer Based Models for Automatic Short Answer Grading
  • May 30, 2025
  • Journal of Information Systems Engineering and Management
  • Rupal Chaudhari

Automatic Short Answer Grading (ASAG) has gained increasing importance in educational technology, where accurate and scalable assessment solutions are needed. Recent advances in Natural Language Processing (NLP) have introduced powerful Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), Text-to-Text Transfer Transformer (T5), and Generative Pre-trained Transformer 3 (GPT-3), which have demonstrated state-of-the-art performance across various text-based tasks. This paper presents a comparative study of these three models in the context of ASAG, evaluating their effectiveness, accuracy, and efficiency. BERT’s bidirectional encoding, T5’s text-to-text framework, and GPT-3’s autoregressive generation are explored in depth to assess their ability to understand, grade, and generate feedback on short answers. We utilize standard ASAG datasets and multiple evaluation metrics, including accuracy, precision, recall, and F1-score, to measure their performance. The comparative analysis reveals that while all three models exhibit strong capabilities, they vary in handling complex language and ambiguous student responses, with trade-offs in computational cost and scalability. This study highlights the strengths and weaknesses of each model in ASAG and offers insights into their practical applications in educational settings. Introduction: The automation of grading has become a focal point in modern education systems, driven by the increasing demand for scalable and efficient assessment solutions (Sahu & Bhowmick, 2015). With the proliferation of online learning platforms, digital classrooms, and remote education, the ability to automatically grade short-answer questions has gained significant importance (Gomaa & Fahmy, 2020). Automatic Short Answer Grading (ASAG) seeks to evaluate student responses by comparing them to model answers, often assessing the content’s correctness, relevance, and linguistic features—critical components for evaluating students’ understanding and knowledge retention (Busatta & Brancher, 2018). Traditional ASAG approaches typically employed rule-based systems, statistical models, and early machine learning algorithms that relied heavily on predefined keywords, templates, or handcrafted features (Tulu et al., 2021). While effective for straightforward, fact-based questions, these systems struggled to capture the complexity and variability of natural language, resulting in reduced grading accuracy—especially for creative or ambiguous responses (Sychev et al., 2019). Consequently, such methods often required significant manual intervention, limiting their scalability and applicability in dynamic educational settings (Muftah & Aziz, 2013). The advent of deep learning, particularly in the field of Natural Language Processing (NLP), has marked a transformative shift in ASAG (Gaddipati et al., 2020). Neural network-based models have demonstrated a remarkable capacity to learn and generalize from large datasets, enabling a more nuanced understanding of language (Wang et al., 2019). This has led to the development of more robust ASAG systems capable of handling a broader spectrum of student responses, ranging from factual answers to complex explanations (Roy et al., 2016). A pivotal advancement in NLP is the introduction of the Transformer architecture, which has revolutionized how language models are designed and trained (Vaswani et al., 2017). Transformers excel in processing sequential data through self-attention mechanisms that capture long-range dependencies and contextual relationships within text. This architectural innovation has significantly enhanced performance across a variety of NLP tasks, such as machine translation, sentiment analysis, and question answering (Peters et al., 2018), making Transformer-based models particularly suitable for enhancing ASAG systems (Raffel et al., 2020). In this paper, we focus on three prominent Transformer-based models—BERT, T5, and GPT-3—each representing a distinct approach to language understanding and processing. These models have set new benchmarks across numerous NLP tasks, and their potential application in ASAG is substantial Objectives: The goal of this study is to conduct a comparative analysis of these three Transformer models—BERT, T5, and GPT-3—in the context of ASAG. We evaluate their performance on standard ASAG datasets using multiple evaluation metrics, such as accuracy, precision, recall, and F1-score. Additionally, we analyze the computational efficiency and scalability of these models to determine their practicality for deployment in large-scale educational environments. Methods: By providing a comprehensive comparison, this study seeks to shed light on the strengths and weaknesses of each model and their suitability for different types of ASAG tasks. Moreover, we aim to offer insights that can guide future research and development in this area, ultimately contributing to the creation of more effective and reliable automated grading systems. Results: The results of our comparative analysis of BERT, T5, and GPT-3 in the context of Automatic Short Answer Grading (ASAG) reveal important insights into the strengths and limitations of these Transformer models. This section discusses the implications of our findings, the practical considerations for deploying these models in educational settings, and identifies potential avenues for future research. Conclusions: In conclusion, this study provides a comprehensive comparative analysis of BERT, T5, and GPT-3 for ASAG, highlighting their strengths, limitations, and practical considerations. The insights gained from this research contribute to the ongoing development and refinement of automated grading systems, with the potential to enhance educational assessment and support in diverse learning environments.

  • Research Article
  • 10.52783/jisem.v10i51s.10376
Performance Analysis of Transformer Based Models for Automatic Short Answer Grading
  • May 30, 2025
  • Journal of Information Systems Engineering and Management
  • Rupal Chaudhari

Automatic Short Answer Grading (ASAG) has gained increasing importance in educational technology, where accurate and scalable assessment solutions are needed. Recent advances in Natural Language Processing (NLP) have introduced powerful Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), Text-to-Text Transfer Transformer (T5), and Generative Pre-trained Transformer 3 (GPT-3), which have demonstrated state-of-the-art performance across various text-based tasks. This paper presents a comparative study of these three models in the context of ASAG, evaluating their effectiveness, accuracy, and efficiency. BERT’s bidirectional encoding, T5’s text-to-text framework, and GPT-3’s autoregressive generation are explored in depth to assess their ability to understand, grade, and generate feedback on short answers. We utilize standard ASAG datasets and multiple evaluation metrics, including accuracy, precision, recall, and F1-score, to measure their performance. The comparative analysis reveals that while all three models exhibit strong capabilities, they vary in handling complex language and ambiguous student responses, with trade-offs in computational cost and scalability. This study highlights the strengths and weaknesses of each model in ASAG and offers insights into their practical applications in educational settings. Introduction: The automation of grading has become a focal point in modern education systems, driven by the increasing demand for scalable and efficient assessment solutions (Sahu & Bhowmick, 2015). With the proliferation of online learning platforms, digital classrooms, and remote education, the ability to automatically grade short-answer questions has gained significant importance (Gomaa & Fahmy, 2020). Automatic Short Answer Grading (ASAG) seeks to evaluate student responses by comparing them to model answers, often assessing the content’s correctness, relevance, and linguistic features—critical components for evaluating students’ understanding and knowledge retention (Busatta & Brancher, 2018). Traditional ASAG approaches typically employed rule-based systems, statistical models, and early machine learning algorithms that relied heavily on predefined keywords, templates, or handcrafted features (Tulu et al., 2021). While effective for straightforward, fact-based questions, these systems struggled to capture the complexity and variability of natural language, resulting in reduced grading accuracy—especially for creative or ambiguous responses (Sychev et al., 2019). Consequently, such methods often required significant manual intervention, limiting their scalability and applicability in dynamic educational settings (Muftah & Aziz, 2013). The advent of deep learning, particularly in the field of Natural Language Processing (NLP), has marked a transformative shift in ASAG (Gaddipati et al., 2020). Neural network-based models have demonstrated a remarkable capacity to learn and generalize from large datasets, enabling a more nuanced understanding of language (Wang et al., 2019). This has led to the development of more robust ASAG systems capable of handling a broader spectrum of student responses, ranging from factual answers to complex explanations (Roy et al., 2016). A pivotal advancement in NLP is the introduction of the Transformer architecture, which has revolutionized how language models are designed and trained (Vaswani et al., 2017). Transformers excel in processing sequential data through self-attention mechanisms that capture long-range dependencies and contextual relationships within text. This architectural innovation has significantly enhanced performance across a variety of NLP tasks, such as machine translation, sentiment analysis, and question answering (Peters et al., 2018), making Transformer-based models particularly suitable for enhancing ASAG systems (Raffel et al., 2020). In this paper, we focus on three prominent Transformer-based models—BERT, T5, and GPT-3—each representing a distinct approach to language understanding and processing. These models have set new benchmarks across numerous NLP tasks, and their potential application in ASAG is substantial Objectives: The goal of this study is to conduct a comparative analysis of these three Transformer models—BERT, T5, and GPT-3—in the context of ASAG. We evaluate their performance on standard ASAG datasets using multiple evaluation metrics, such as accuracy, precision, recall, and F1-score. Additionally, we analyze the computational efficiency and scalability of these models to determine their practicality for deployment in large-scale educational environments. Methods: By providing a comprehensive comparison, this study seeks to shed light on the strengths and weaknesses of each model and their suitability for different types of ASAG tasks. Moreover, we aim to offer insights that can guide future research and development in this area, ultimately contributing to the creation of more effective and reliable automated grading systems. Results: The results of our comparative analysis of BERT, T5, and GPT-3 in the context of Automatic Short Answer Grading (ASAG) reveal important insights into the strengths and limitations of these Transformer models. This section discusses the implications of our findings, the practical considerations for deploying these models in educational settings, and identifies potential avenues for future research. Conclusions: In conclusion, this study provides a comprehensive comparative analysis of BERT, T5, and GPT-3 for ASAG, highlighting their strengths, limitations, and practical considerations. The insights gained from this research contribute to the ongoing development and refinement of automated grading systems, with the potential to enhance educational assessment and support in diverse learning environments.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s12572-017-0202-9
Selection of automatic short answer grading techniques using contextual bandits for different evaluation measures
  • Feb 20, 2018
  • International Journal of Advances in Engineering Sciences and Applied Mathematics
  • Shourya Roy + 2 more

Automatic short answer grading (ASAG) systems are designed to automatically assess short answers in natural language having a length of a few words to a few sentences. Many ASAG techniques have been proposed in the literature. In this paper, we critically analyse the role of evaluation measures used for assessing the quality of ASAG techniques. In real-world settings, multiple factors such as, difficulty level, and diversity of student answers, vary significantly across questions, leading to different ASAG techniques emerging as superior for different evaluation measures. Building upon this observation, we propose to automatic learning of a mapping from questions to ASAG techniques using minimal human (expert/crowd) feedback. We do this by formulating the learning task as a contextual bandits problem and providing a rigorous regret minimization algorithm that handles key practical considerations, such as, noisy experts and similarity between questions. Our approach offers the flexibility to include new ASAG systems on the fly and does not require the human expert to have knowledge of the working details of the system while providing feedback. With extensive simulations on a standard dataset, we demonstrate that our approach yields outcomes that are remarkably consistent with human evaluations.

  • Research Article
  • 10.11591/ijece.v14i1.pp841-853
Deep learning based Arabic short answer grading in serious games
  • Feb 1, 2024
  • International Journal of Electrical and Computer Engineering (IJECE)
  • Younes Alaoui Soulimani + 2 more

Automatic short answer grading (ASAG) has become part of natural language processing problems. Modern ASAG systems start with natural language preprocessing and end with grading. Researchers started experimenting with machine learning in the preprocessing stage and deep learning techniques in automatic grading for English. However, little research is available on automatic grading for Arabic. Datasets are important to ASAG, and limited datasets are available in Arabic. In this research, we have collected a set of questions, answers, and associated grades in Arabic. We have made this dataset publicly available. We have extended to Arabic the solutions used for English ASAG. We have tested how automatic grading works on answers in Arabic provided by schoolchildren in 6th grade in the context of serious games. We found out those schoolchildren providing answers that are 5.6 words long on average. On such answers, deep learning-based grading has achieved high accuracy even with limited training data. We have tested three different recurrent neural networks for grading. With a transformer, we have achieved an accuracy of 95.67%. ASAG for school children will help detect children with learning problems early. When detected early, teachers can solve learning problems easily. This is the main purpose of this research.

  • Research Article
  • Cite Count Icon 37
  • 10.1109/tlt.2019.2897997
Feature Engineering and Ensemble-Based Approach for Improving Automatic Short-Answer Grading Performance
  • Jan 1, 2020
  • IEEE Transactions on Learning Technologies
  • Archana Sahu + 1 more

In this paper, we studied different automatic short answer grading (ASAG) systems to provide a comprehensive view of the feature spaces explored by previous works. While the performance reported in previous works have been encouraging, systematic study of the features is lacking. Apart from providing systematic feature space exploration, we also presented ensemble methods that have been experimentally validated to exhibit significantly higher grading performance over the existing papers in almost all the datasets in ASAG domain. A comparative study over different features and regression models toward short-answer grading has been performed with respect to evaluation metrics used in evaluating ASAG. Apart from traditional text similarity based features like WordNet similarity, latent semantic analysis, and others, we have introduced novel features like topic models suited for short text, relevance feedback based features. An ensemble-based model has been built using a combination of different regression models with an approach based on stacked regression. The proposed ASAG has been tested on the University of North Texas dataset for the regression task, whereas in case of classification task, the student response analysis (SRA) based ScientsBank and Beetle corpus have been used for evaluation. The grading performance in case of ensemble-based ASAG is highly boosted from that exhibited by an individual regression model. Extensive experimentation has revealed that feature selection, introduction of novel features, and regressor stacking have been instrumental in achieving considerable improvement in performance over the existing methods in ASAG domain.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/tlt.2023.3253071
Embeddings for Automatic Short Answer Grading: A Scoping Review
  • Apr 1, 2023
  • IEEE Transactions on Learning Technologies
  • Marko Putnikovic + 1 more

Automatic grading of short answers is an important task in computer-assisted assessment (CAA). Recently, embeddings, as semantic-rich textual representations, have been increasingly used to represent short answers and predict the grade. Despite the recent trend of applying embeddings in automatic short answer grading (ASAG), there are no systematic reviews of literature reporting on their usage. Therefore, following the PRISMA-ScR guidelines, this scoping review summarises relevant literature on the use of embeddings in ASAG, and reports on the current state of the art in that research area and on the identified knowledge gaps. We searched seven research databases for the relevant journal, conference, and workshop papers published from 2016 to July 2021. The inclusion criteria were based on the type of publication, its venue ranking, study focus, and evaluation methods. Upon the full-text screening, 17 articles were included in the scoping review. Among these, most of the articles used word embeddings, mainly to estimate the similarity of student and model answers using the cosine similarity measure or to initialise a neural network-based classification model. The contribution of embeddings to the performance of ASAG models compared to non-embedding features is inconclusive. Models employing embeddings were mostly evaluated on four public ASAG datasets using earlier ASAG methods as baselines. We summarise the reported evaluation results and draw conclusions on the performance of the state-of-the-art ASAG models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1007/s40593-023-00361-2
Cheating Automatic Short Answer Grading with the Adversarial Usage of Adjectives and Adverbs
  • Jul 26, 2023
  • International Journal of Artificial Intelligence in Education
  • Anna Filighera + 3 more

Automatic grading models are valued for the time and effort saved during the instruction of large student bodies. Especially with the increasing digitization of education and interest in large-scale standardized testing, the popularity of automatic grading has risen to the point where commercial solutions are widely available and used. However, for short answer formats, automatic grading is challenging due to natural language ambiguity and versatility. While automatic short answer grading models are beginning to compare to human performance on some datasets, their robustness, especially to adversarially manipulated data, is questionable. Exploitable vulnerabilities in grading models can have far-reaching consequences ranging from cheating students receiving undeserved credit to undermining automatic grading altogether—even when most predictions are valid. In this paper, we devise a black-box adversarial attack tailored to the educational short answer grading scenario to investigate the grading models’ robustness. In our attack, we insert adjectives and adverbs into natural places of incorrect student answers, fooling the model into predicting them as correct. We observed a loss of prediction accuracy between 10 and 22 percentage points using the state-of-the-art models BERT and T5. While our attack made answers appear less natural to humans in our experiments, it did not significantly increase the graders’ suspicions of cheating. Based on our experiments, we provide recommendations for utilizing automatic grading systems more safely in practice.

  • Conference Article
  • Cite Count Icon 47
  • 10.24963/ijcai.2017/284
Earth Mover's Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading
  • Aug 1, 2017
  • Sachin Kumar + 2 more

Automatic short answer grading (ASAG) can reduce tedium for instructors, but is complicated by free-form student inputs. An important ASAG task is to assign ordinal scores to student answers, given some “model” or ideal answers. Here we introduce a novel framework for ASAG by cascading three neural building blocks: Siamese bidirectional LSTMs applied to a model and a student answer, a novel pooling layer based on earth-mover distance (EMD) across all hidden states from both LSTMs, and a flexible final regression layer to output scores. On standard ASAG data sets, our system shows substantial reduction in grade estimation error compared to competitive baselines. We demonstrate that EMD pooling results in substantial accuracy gains, and that a support vector ordinal regression (SVOR) output layer helps outperform softmax. Our system also outperforms recent attention mechanisms on LSTM states.

  • Research Article
  • Cite Count Icon 300
  • 10.1007/s40593-014-0026-8
The Eras and Trends of Automatic Short Answer Grading
  • Oct 23, 2014
  • International Journal of Artificial Intelligence in Education
  • Steven Burrows + 2 more

Automatic short answer grading (ASAG) is the task of assessing short natural language responses to objective questions using computational methods. The active research in this field has increased enormously of late with over 80 papers fitting a definition of ASAG. However, the past efforts have generally been ad-hoc and non-comparable until recently, hence the need for a unified view of the whole field. The goal of this paper is to address this aim with a comprehensive review of ASAG research and systems according to history and components. Our historical analysis identifies 35 ASAG systems within 5 temporal themes that mark advancement in methodology or evaluation. In contrast, our component analysis reviews 6 common dimensions from preprocessing to effectiveness. A key conclusion is that an era of evaluation is the newest trend in ASAG research, which is paving the way for the consolidation of the field.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/tale48000.2019.9226026
Automatic Short Answer Grading using Siamese Bidirectional LSTM Based Regression
  • Dec 1, 2019
  • Arya Prabhudesai + 1 more

Automatic student assessment plays an important role in education - it provides instant feedback to learners, and at the same time reduces tedious grading workload for instructors. In this paper, we investigate new machine learning techniques for automatic short answer grading (ASAG). The ASAG problem mainly involves assessing short, natural language responses to given questions automatically. While current research in the field has focused either on feature engineering or deep learning, we propose a new approach which combines the advantages of both. More specifically, we propose a Siamese Bidirectional LSTM Neural Network based Regressor in conjunction with handcrafted features for ASAG. Extensive experiments using the popular Mohler ASAG dataset which contains training samples from Computer Science courses, have demonstrated that our system, despite being simpler, provides similar or better overall performance in terms of grading accuracy (measured with Pearson r, mean absolute error and root mean squared error) compared to state-of-the-art results.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-82153-1_11
Incorporating Question Information to Enhance the Performance of Automatic Short Answer Grading
  • Jan 1, 2021
  • Shuang Chen + 1 more

Automatic short answer grading (ASAG) is focusing on tackling the problem of automatically assessing students’ constructed responses to open-ended questions. ASAG is still far from being a reality in NLP. Previous work mainly concentrates on exploiting feature extraction from the textual information between the student answer and the model answer. A grade will be assigned to the student based on the similarity of his/her answers and the model answer. However, ASAG models trained by the same type of features lack the capacity to deal with a diversity of conceptual representations in students’ responses. To capture multiple types of features, prior knowledge is utilized in our work to enrich the obtained features. The whole model is based on the Transformer. More specifically, a novel training approach is proposed. Forward propagation is added in the training step randomly to exploit the textual information between the provided questions and student answers in a training step. A feature fusion layer followed by an output layer is introduced accordingly for fine-tuning purposes. We evaluate the proposed model on two datasets (the University of North Texas dataset and student response analysis (SRA) dataset). A comparison is conducted on the ASAG task between the proposed model and the baselines. The performance results show that our model is superior to the recent state-of-the-art models.

  • Conference Article
  • Cite Count Icon 1
  • 10.5753/sbie.2024.242424
Prompt Engineering for Automatic Short Answer Grading in Brazilian Portuguese
  • Nov 4, 2024
  • Rafael Ferreira Mello + 6 more

Automatic Short Answer Grading (ASAG) is a prominent area of Artificial Intelligence in Education (AIED). Despite much research, developing ASAG systems is challenging, even when focused on a single subject, mostly due to the variability in length and content of students' answers. While recent research has explored Large Language Models (LLMs) to enhance the efficiency of ASAG, the LLM performance is highly dependent on the prompt design. In that context, prompt engineering plays a crucial role. However, to the best of our knowledge, no research has systematically investigated prompt engineering in ASAG. Thus, this study compares over 128 prompt combinations for a Portuguese dataset based on GPT-3.5-Turbo and GPT-4-Turbo. Our findings indicate the crucial role of specific prompt components in improving GPT results and shows that GPT-4 consistently outperformed GPT-3.5 in this domain. These insights guide prompt design for ASAG in the context of Brazilian Portuguese. Therefore, we recommend students, educators, and developers leverage these findings to optimize prompt design and benefit from the advancements offered by state-of-the-art LLMs whenever possible.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-981-19-8040-4_5
Explainability in Automatic Short Answer Grading
  • Jan 1, 2023
  • Tim Schlippe + 3 more

Massive open online courses and other online study opportunities are providing easier access to education for more and more people around the world. To cope with the large number of exams to be assessed in these courses, AI-driven automatic short answer grading can recommend teaching staff to assign points when evaluating free text answers, leading to faster and fairer grading. But what would be the best way to work with the AI? In this paper, we investigate and evaluate different methods for explainability in automatic short answer grading. Our survey of over 70 professors, lecturers and teachers with grading experience showed that displaying the predicted points together with matches between student answer and model answer is rated better than the other tested explainable AI (XAI) methods in the aspects trust, informative content, speed, consistency and fairness, fun, comprehensibility, applicability, use in exam preparation, and in general.KeywordsExplainabilityExplainable AIXAIAutomatic short answer gradingAI in education

  • Research Article
  • Cite Count Icon 14
  • 10.12928/telkomnika.v17i2.11785
A scoring rubric for automatic short answer grading system
  • Apr 1, 2019
  • TELKOMNIKA (Telecommunication Computing Electronics and Control)
  • Uswatun Hasanah + 3 more

During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students’ learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student’s answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error (𝑀𝐴𝐸) value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.

  • Research Article
  • 10.4197/comp.11-2.2
Automatic Short Answer Grading Using Paragraph Vectors and Transfer Learning Models
  • Dec 1, 2022
  • Journal of King Abdulaziz University: Computing and Information Technology Sciences
  • Abrar Alreheli + 1 more

Grading questions is one of the most duties that consume the time and efforts of instructors. Among different questions formats, short answers measure students’ depth of understanding. Several types of research have been done to grade short answers automatically. Recent approaches attempt to solve this problem using semantic similarity and deep learning models. Correspondingly, paragraph embedding models and the transfer learning models have shown promising results in text-similarity tasks considering the context of the text. This study investigates distributional semantics and deep learning models applied to the task of short answers grading by computing the semantic similarity between students submitted answers and a key answer. We analyze the effect of training two different models on the domain-specific corpus. We suggest to train the models on a domain-specific corpus instead of using pre-trained models. In the first experiment, paragraph vectors are trained on a domain-specific corpus and in the second one, transfer learning selected models were fine-tuned on the domain-specific corpus. The best accuracy achieved by fine-tuning the (roberta-large) masked language model on domain-specific corpus is 0.620 for correlation coefficient, and 0.777 for RMSE. We compare the achieved results against baseline models and the results of former studies. We conclude that pre-trained paragraph vectors achieve better semantic similarity than training paragraph vectors on a domain-specific corpus. On the contrary, fine-tuning transfer learning models on a domain-specific corpus improve the performance using the pre-trained masked language models.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.