Fine-grained multi-prompt essay scoring with multi-level disentanglement

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract The application of language models in essay scoring has gained significant attention in recent years, typically on the basis of evaluating a single model across multiple prompts. However, in a multi-prompt setup, it is crucial to understand the varying aspects of different prompts. In such settings, there exist notable variations even in a trait with the same name across prompts. This semantic variation on the same traits underscores the need to treat them differently at a fine-grained level according to each prompt. In this study, we propose a multi-level disentanglement framework for multi-prompt essay scoring, designed to achieve fine-grained disentanglement of semantic differences across such traits. Our method not only improves the quality of the essay scoring, but also reduces memory usage and latency. Experimental results highlight that our framework surpasses seven state-of-the-art essay scoring methods and large language model(LLM)-based zero-shot and few-shot approaches, achieving the highest agreement with human essay ratings.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41598-025-92586-5
Efficient self-attention with smart pruning for sustainable large language models
  • Mar 24, 2025
  • Scientific Reports
  • Samir Brahim Belhaouari + 1 more

Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.

  • Research Article
  • Cite Count Icon 18
  • 10.1177/20531680241231468
Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale
  • Jan 1, 2024
  • Research & Politics
  • Jonathan Mellon + 5 more

Can artificial intelligence accurately label open-text survey responses? We compare the accuracy of six large language models (LLMs) using a few-shot approach, three supervised learning algorithms (SVM, DistilRoBERTa, and a neural network trained on BERT embeddings), and a second human coder on the task of categorizing “most important issue” responses from the British Election Study Internet Panel into 50 categories. For the scenario where a researcher lacks existing training data, the accuracy of the highest-performing LLM (Claude-1.3: 93.9%) neared human performance (94.7%) and exceeded the highest-performing supervised approach trained on 1000 randomly sampled cases (neural network: 93.5%). In a scenario where previous data has been labeled but a researcher wants to label novel text, the best LLM’s (Claude-1.3: 80.9%) few-shot performance is only slightly behind the human (88.6%) and exceeds the best supervised model trained on 576,000 cases (DistilRoBERTa: 77.8%). PaLM-2, Llama-2, and the SVM all performed substantially worse than the best LLMs and supervised models across all metrics and scenarios. Our results suggest that LLMs may allow for greater use of open-ended survey questions in the future.

  • Discussion
  • 10.1111/cogs.13430
Large Language Models: A Historical and Sociocultural Perspective.
  • Mar 1, 2024
  • Cognitive science
  • Eugene Yu Ji

This letter explores the intricate historical and contemporary links between large language models (LLMs) and cognitive science through the lens of information theory, statistical language models, and socioanthropological linguistic theories. The emergence of LLMs highlights the enduring significance of information-based and statistical learning theories in understanding human communication. These theories, initially proposed in the mid-20th century, offered a visionary framework for integrating computational science, social sciences, and humanities, which nonetheless was not fully fulfilled at that time. The subsequent development of sociolinguistics and linguistic anthropology, especially since the 1970s, provided critical perspectives and empirical methods that both challenged and enriched this framework. This letter proposes that two pivotal concepts derived from this development, metapragmatic function and indexicality, offer a fruitful theoretical perspective for integrating the semantic, textual, and pragmatic, contextual dimensions of communication, an amalgamation that contemporary LLMs have yet to fully achieve. The author believes that contemporary cognitive science is at a crucial crossroads, where fostering interdisciplinary dialogues among computational linguistics, social linguistics and linguistic anthropology, and cognitive and social psychology is in particular imperative. Such collaboration is vital to bridge the computational, cognitive, and sociocultural aspects of human communication and human-AI interaction, especially in the era of large language and multimodal models and human-centric Artificial Intelligence (AI).

  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Conference Article
  • Cite Count Icon 104
  • 10.1145/3510003.3510203
Jigsaw
  • May 21, 2022
  • Naman Jain + 6 more

Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.procs.2023.09.086
A Large and Diverse Arabic Corpus for Language Modeling
  • Jan 1, 2023
  • Procedia Computer Science
  • Abbas Raza Ali + 3 more

A Large and Diverse Arabic Corpus for Language Modeling

  • Research Article
  • 10.1108/ir-02-2025-0074
Large language and vision-language models for robot: safety challenges, mitigation strategies and future directions
  • Jul 29, 2025
  • Industrial Robot: the international journal of robotics research and application
  • Xiangyu Hu + 1 more

Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.

  • Research Article
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • Cite Count Icon 291
  • 10.1111/bjet.13370
Practical and ethical challenges of large language models in education: A systematic scoping review
  • Aug 6, 2023
  • British Journal of Educational Technology
  • Lixiang Yan + 8 more

Educational technology innovations leveraging large language models (LLMs) have shown the potential to automate the laborious process of generating and analysing textual content. While various innovations have been developed to automate a range of educational tasks (eg, question generation, feedback provision, and essay grading), there are concerns regarding the practicality and ethicality of these innovations. Such concerns may hinder future research and the adoption of LLMs‐based innovations in authentic educational contexts. To address this, we conducted a systematic scoping review of 118 peer‐reviewed papers published since 2017 to pinpoint the current state of research on using LLMs to automate and support educational tasks. The findings revealed 53 use cases for LLMs in automating education tasks, categorised into nine main categories: profiling/labelling, detection, grading, teaching support, prediction, knowledge representation, feedback, content generation, and recommendation. Additionally, we also identified several practical and ethical challenges, including low technological readiness, lack of replicability and transparency and insufficient privacy and beneficence considerations. The findings were summarised into three recommendations for future studies, including updating existing innovations with state‐of‐the‐art models (eg, GPT‐3/4), embracing the initiative of open‐sourcing models/systems, and adopting a human‐centred approach throughout the developmental process. As the intersection of AI and education is continuously evolving, the findings of this study can serve as an essential reference point for researchers, allowing them to leverage the strengths, learn from the limitations, and uncover potential research opportunities enabled by ChatGPT and other generative AI models. Practitioner notes What is currently known about this topic Generating and analysing text‐based content are time‐consuming and laborious tasks. Large language models are capable of efficiently analysing an unprecedented amount of textual content and completing complex natural language processing and generation tasks. Large language models have been increasingly used to develop educational technologies that aim to automate the generation and analysis of textual content, such as automated question generation and essay scoring. What this paper adds A comprehensive list of different educational tasks that could potentially benefit from LLMs‐based innovations through automation. A structured assessment of the practicality and ethicality of existing LLMs‐based innovations from seven important aspects using established frameworks. Three recommendations that could potentially support future studies to develop LLMs‐based innovations that are practical and ethical to implement in authentic educational contexts. Implications for practice and/or policy Updating existing innovations with state‐of‐the‐art models may further reduce the amount of manual effort required for adapting existing models to different educational tasks. The reporting standards of empirical research that aims to develop educational technologies using large language models need to be improved. Adopting a human‐centred approach throughout the developmental process could contribute to resolving the practical and ethical challenges of large language models in education.

  • Research Article
  • 10.1080/13658816.2025.2577252
Extraction of geoprocessing modeling knowledge from crowdsourced Google Earth Engine scripts by coordinating large and small language models
  • Nov 1, 2025
  • International Journal of Geographical Information Science
  • Anqi Zhao + 7 more

The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).

  • Research Article
  • Cite Count Icon 3
  • 10.1109/embc53108.2024.10782119
High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models.
  • Jul 15, 2024
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Syed I Munzir + 2 more

Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past 30 years, progress toward making high-throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping physician notes.Clinical relevance: Large language models will likely emerge as the dominant method for the high throughput phenotyping of signs and symptoms in physician notes.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 55
  • 10.1038/s41746-024-01024-9
CancerGPT for few shot drug pair synergy prediction using large pretrained language models
  • Feb 19, 2024
  • NPJ Digital Medicine
  • Tianhao Li + 6 more

Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.

  • Research Article
  • 10.25136/2409-8698.2024.4.70455
Optimization of traditional methods for determining the similarity of project names and purchases using large language models
  • Apr 1, 2024
  • Litera
  • Aleksei Aleksandrovich Golikov + 2 more

The subject of the study is the analysis and improvement of methods for determining the relevance of project names to the information content of purchases using large language models. The object of the study is a database containing the names of projects and purchases in the field of electric power industry, collected from open sources. The author examines in detail such aspects of the topic as the use of TF-IDF and cosine similarity metrics for primary data filtering, and also describes in detail the integration and evaluation of the effectiveness of large language models such as GigaChat, GPT-3.5, and GPT-4 in text data matching tasks. Special attention is paid to the methods of clarifying the similarity of names based on reflection introduced into the prompta of large language models, which makes it possible to increase the accuracy of data comparison. The study uses TF-IDF and cosine similarity methods for primary data analysis, as well as large GigaChat, GPT-3.5 and GPT-4 language models for detailed verification of the relevance of project names and purchases, including reflection in model prompta to improve the accuracy of results. The novelty of the research lies in the development of a combined approach to determining the relevance of project names and purchases, combining traditional methods of processing text information (TF-IDF, cosine similarity) with the capabilities of large language models. A special contribution of the author to the research of the topic is the proposed methodology for improving the accuracy of data comparison by clarifying the results of primary selection using GPT-3.5 and GPT-4 models with optimized prompta, including reflection. The main conclusions of the study are confirmation of the prospects of using the developed approach in the tasks of information support for procurement processes and project implementation, as well as the possibility of using the results obtained for the development of text data mining systems in various sectors of the economy. The study showed that the use of language models makes it possible to improve the value of the F2 measure to 0.65, which indicates a significant improvement in the quality of data comparison compared with basic methods.

  • Research Article
  • 10.1145/3749840
AdaptiveLog: An Adaptive Log Analysis Framework with the Collaboration of Large and Small Language Model
  • Jul 22, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Lipeng Ma + 8 more

Automated log analysis is crucial to ensure the high availability and reliability of complex systems. The advent of large language models (LLMs) in natural language processing (NLP) has ushered in a new era of language model-driven automated log analysis, garnering significant interest. Within this field, two primary paradigms based on language models for log analysis have become prominent. Small Language Models (SLMs) (such as BERT) follow the pre-train and fine-tune paradigm, focusing on the specific log analysis task through fine-tuning on supervised datasets. On the other hand, LLMs (such as ChatGPT) following the in-context learning paradigm, analyze logs by providing a few examples in prompt contexts without updating parameters. Despite their respective strengths, both models exhibit inherent limitations. By comparing SLMs and LLMs, we notice that SLMs are more cost-effective but less powerful, whereas LLMs with large parameters are highly powerful but expensive and inefficient. To trade-off between the performance and inference costs of both models in automated log analysis, this paper introduces an adaptive log analysis framework known as AdaptiveLog, which effectively reduces the costs associated with LLM while ensuring superior results. This framework collaborates an LLM and a small language model, strategically allocating the LLM to tackle complex logs while delegating simpler logs to the SLM. Specifically, to efficiently query the LLM, we propose an adaptive selection strategy based on the uncertainty estimation of the SLM, where the LLM is invoked only when the SLM is uncertain. In addition, to enhance the reasoning ability of the LLM in log analysis tasks, we propose a novel prompt strategy by retrieving similar error-prone cases as the reference, enabling the model to leverage past error experiences and learn solutions from these cases. We evaluate AdaptiveLog on different log analysis tasks, extensive experiments demonstrate that AdaptiveLog achieves state-of-the-art results across different tasks, elevating the overall accuracy of log analysis while maintaining cost efficiency. Our source code and detailed experimental data are available at https://github.com/LeaperOvO/AdaptiveLog-review .

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.caeai.2024.100234
Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability
  • May 10, 2024
  • Computers and Education: Artificial Intelligence
  • Austin Pack + 2 more

Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon