Automatic generation of physics items with Large Language Models (LLMs)
High-quality items are essential for producing reliable and valid assessments, offering valuable insights for decision-making processes. As the demand for items with strong psychometric properties increases for both summative and formative assessments, automatic item generation (AIG) has gained prominence. Research highlights the potential of large language models (LLMs) in the AIG process, noting the positive impact of generative AI tools like ChatGPT on educational assessments, recognized for their ability to generate various item types across different languages and subjects. This study fills a research gap by exploring how AI-generated items in secondary/high school physics aligned with educational taxonomy. It utilizes Bloom's taxonomy, a well-known framework for designing and categorizing assessment items across various cognitive levels, from low to high. It focuses on a preliminary assessment of LLMs ability to generate physics items that match the Bloom’s taxonomy application level. Two leading LLMs, ChatGPT (GPT-4) and Gemini, were chosen for their strong performance in creating high-quality educational content. The research utilized various prompts to generate items at different cognitive levels based on Bloom's taxonomy. These items were assessed using multiple criteria: clarity, accuracy, absence of misleading content, appropriate complexity, correct language use, alignment with the intended level of Bloom's taxonomy, solvability, and assurance of a single correct answer. The findings indicated that both ChatGPT and Gemini were skilled at generating physics assessment items, though their effectiveness varied based on the prompting methods used. Instructional prompts, particularly, resulted in excellent outputs from both models, producing items that were clear, precise, and consistently aligned with the Application level of Bloom's taxonomy.
- Research Article
- 10.21449/ijate.1602294
- Jun 1, 2025
- International Journal of Assessment Tools in Education
This study reviews existing research on the use of large language models (LLMs) for automatic item generation (AIG). We performed a comprehensive literature search across seven research databases, selected studies based on predefined criteria, and summarized 60 relevant studies that employed LLMs in the AIG process. We identified the most commonly used LLMs in current AIG literature, their specific applications in the AIG process, and the characteristics of the generated items. We found that LLMs are flexible and effective in generating various types of items across different languages and subject domains. However, many studies have overlooked the quality of the generated items, indicating a lack of a solid educational foundation. Therefore, we share two suggestions to enhance the educational foundation for leveraging LLMs in AIG, advocating for interdisciplinary collaborations to exploit the utility and potential of LLMs.
- Research Article
1
- 10.1016/j.nepr.2025.104488
- Aug 1, 2025
- Nurse education in practice
AI or nay? Evaluating the potential use of ChatGPT (Open AI) and Perplexity AI in undergraduate nursing research: An exploratory case study.
- Research Article
3
- 10.1152/advan.00137.2024
- Dec 1, 2024
- Advances in physiology education
The advent of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Gemini, has significantly impacted the educational landscape, offering unique opportunities for learning and assessment. In the realm of written assessment grading, traditionally viewed as a laborious and subjective process, this study sought to evaluate the accuracy and reliability of these LLMs in evaluating the achievement of learning outcomes across different cognitive domains in a scientific inquiry course on sports physiology. Human graders and three LLMs, GPT-3.5, GPT-4o, and Gemini, were tasked with scoring submitted student assignments according to a set of rubrics aligned with various cognitive domains, namely "Understand," "Analyze," and "Evaluate" from the revised Bloom's taxonomy and "Scientific Inquiry Competency." Our findings revealed that while LLMs demonstrated some level of competency, they do not yet meet the assessment standards of human graders. Specifically, interrater reliability (percentage agreement and correlation analysis) between human graders was superior as compared to between two grading rounds for each LLM, respectively. Furthermore, concordance and correlation between human and LLM graders were mostly moderate to poor in terms of overall scores and across the pre-specified cognitive domains. The results suggest a future where AI could complement human expertise in educational assessment but underscore the importance of adaptive learning by educators and continuous improvement in current AI technologies to fully realize this potential.NEW & NOTEWORTHY The advent of large language models (LLMs) such as ChatGPT and Gemini has offered new learning and assessment opportunities to integrate artificial intelligence (AI) with education. This study evaluated the accuracy of LLMs in assessing an assignment from a course on sports physiology. Concordance and correlation between human graders and LLMs were mostly moderate to poor. The findings suggest AI's potential to complement human expertise in educational assessment alongside the need for adaptive learning by educators.
- Research Article
3
- 10.6018/red.603001
- May 30, 2024
- Revista de Educación a Distancia (RED)
There is a significant gap in Computing Education Research (CER) concerning the impact of Large Language Models (LLMs) in advanced stages of degree programmes. This study aims to address this gap by investigating the effectiveness of LLMs in answering exam questions within an applied machine learning final-year undergraduate course. The research examines the performance of LLMs in responding to a range of exam questions, including proctored closed-book and open-book questions spanning various levels of Bloom’s Taxonomy. Question formats encompassed open-ended, tabular data-based, and figure-based inquiries. To achieve this aim, the study has the following objectives: Comparative Analysis: To compare LLM-generated exam answers with actual student submissions to assess LLM performance. Detector Evaluation: To evaluate the efficacy of LLM detectors by directly inputting LLM-generated responses into these detectors. Additionally, assess detector performance on tampered LLM outputs designed to conceal their AI-generated origin. The research methodology used for this paper incorporates a staff-student partnership model involving eight academic staff and six students. Students play integral roles in shaping the project’s direction, particularly in areas unfamiliar to academic staff, such as specific tools to avoid LLM detection. This study contributes to the understanding of LLMs' role in advanced education settings, with implications for future curriculum design and assessment methodologies. Existe un importante vacío en la Investigación de Educación en Computación (CER) sobre el impacto de Modelos de Lenguaje de Gran Escala (LLM) en etapas avanzadas de estudios de grado. Este artículo trata de cubrir este vacío investigando la efectividad de las LLM respondiendo preguntas de examen de Aprendizaje Automático Aplicado en último curso de Grado. El estudio examina el desempeño de las LLM al responder a una variedad de preguntas de examen, que incluyen modelos de examen diseñados con y sin apuntes, a varios niveles de la Taxonomía de Bloom. Los formatos de pregunta incluyen de respuesta abierta, basadas en tablas, o en figuras. Para conseguir esta meta, este estudio tiene los siguientes objetivos: Análisis Comparativo: Comparar respuestas generadas por LLM y por estudiantes para juzgar el desempeño de las LLM. Evaluación de Detectores: Evaluar la eficacia de diferentes detectores de LLM. Además, juzgar la eficacia de los detectores sobre texto alterado por alumnos con el objetivo de engañar a los detectores. El método investigador de este artículo incorpora una relación entre seis alumnos y ocho profesores. Los estudiantes juegan un rol integral para determinar la dirección del proyecto, en especial en áreas poco conocidas para el profesorado, como el uso de herramientas de detección de LLM. Este estudio contribuye a entender el rol de las LLM en el ámbito de la educación universitaria, con implicaciones para el diseño de futuros curriculums y técnicas de evaluación. NA
- Research Article
48
- 10.2196/52113
- Jan 23, 2024
- Journal of medical Internet research
Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.
- Research Article
- 10.29121/shodhkosh.v5.i5.2024.6108
- May 31, 2024
- ShodhKosh: Journal of Visual and Performing Arts
The creation of assessment questions that align with Bloom's taxonomy levels and achieve Course Outcomes (COs) is a critical yet complex task in Outcome-Based Education (OBE). Traditional manual methods, reliant on subject experts, are time-consuming and prone to gaps in addressing all COs or Bloom's levels. While Large Language Models (LLMs) like ChatGPT can generate questions, they lack access to private data, including prescribed textbooks and syllabi, potentially leading to questions beyond the scope of the curriculum. This paper presents a novel system leveraging Retrieval-Augmented Generation (RAG) to automate the generation of Bloom's taxonomy-based questions within the syllabus scope, ensuring comprehensive CO attainment. The proposed system integrates a vector database to store private data, including scanned textbooks, syllabi, Bloom's taxonomy levels, and COs. The RAG model, trained on this curated dataset, generates questions that fulfill the cognitive, psychomotor, and affective domain requirements specified in the syllabus. This approach not only ensures alignment with educational objectives but also significantly reduces the manual effort involved in question preparation. The system's efficacy is demonstrated through its ability to produce high-quality, targeted questions that effectively support OBE evaluation and enhance educational quality. This innovation addresses a critical gap in automated question generation for modern education systems.
- Research Article
71
- 10.1001/jamanetworkopen.2023.46721
- Dec 7, 2023
- JAMA network open
Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
- Book Chapter
1
- 10.1007/978-981-97-8588-9_23
- Jan 1, 2025
In the educational world, leveraging advanced technology, particularly for accreditation tasks, presents a promising avenue for enhancing efficiency and user experience. This study implements a web application integrating the GPT-4 model via OpenAI's Application Programming Interface (API) to streamline the National Commission for Academic Accreditation & Assessment (NCAAA) accreditation for Computer Science postgraduate programs at King Abdulaziz University (KAU), Saudi Arabia. Traditionally, fulfilling these requirements entailed a substantial workload, including crafting detailed course reports and updating assessment questions to align with Course Learning Outcomes (CLOs) and Bloom's Taxonomy levels, typically consuming about 5 h per course, resulting in delayed submission. Our solution employs a GPT-4 Large Language Model (LLM) with prompt engineering and OpenAI's API to automate the drafting of course reports and the generation of assessment questions, effectively reducing the task completion time by approximately 90% and encouraging timely submissions. The system's asynchronous design allows for automated background processing, employing a modular architecture to improve development and testing in a software engineering manner. Preliminary user feedback attests to the system's capacity to significantly ease the accreditation process burden, attributed to its intuitive user interface, autocomplete functionalities, and the capability to upload draft questions for assessments. This research demonstrates the potential of Artificial Intelligence (AI), particularly LLM and prompt engineering techniques, to improve manual accreditation tasks but also supports wider adoption and further exploration of such technologies in academic settings, thereby making the accreditation process more efficient across university departments in the Kingdom.
- Research Article
- 10.7759/cureus.65642
- Jul 29, 2024
- Cureus
Introduction Theory question papers form an important part of assessment in medical education. As per the Competency-Based Medical Education (CBME) guidelines 2019, questions should test higher levels of cognition. This pilot study analyzes 60 question papers from different universities in Gujarat for their construct and content validity. The aim was to analyze the quality of physiology question papers from various medical universities in Gujarat to gain insights into assessment quality and its alignment with the CBME guidelines. The objectives were twofold: to evaluate the "construct validity" and "content validity" of these physiology theory question papers over the past three years according to the CBME standards. Methods An observational study using a cross-sectional records-based approach was carried out, evaluating 60 summative exam question papers in physiology from eight different universities of Gujarat for their construct and content validity. Using Bloom's taxonomy, the learning level of the cognitive domain for the questions asked was assessed. The findings compared and displayed a sample of papers. Results A total of 1842 questions were analyzed from the 60 question papers of eight different universities of the Gujarat state. The study found that the questions asked for different levels of cognition in Bloom's taxonomy, i.e., remember, understand, apply, analyze, evaluate, and create, were 560 (30.40%), 434 (23.26%), 222 (12.05%), 118 (6.41%), 94 (5.10%), and 0.00%, respectively. A total of 414 (22.48%) questions did not have any verb, so they did not fit into any level of Bloom's taxonomy. The majority of questions (1773, 96.25%) were asked from the core competencies, while a small percentage (69, 3.75%) of questions were asked from the non-core competencies of physiology. Conclusion The majority of questions in the summative question papers in physiology were of level "remember" and "understand" as per Bloom's taxonomy. Of the questions, 26% did not have any verb. There is a need to incorporate more questions testing higher levels of cognition and to use blueprints by universities. Faculty training is also necessary to bring about course correction.
- Research Article
124
- 10.1097/corr.0000000000002704
- May 23, 2023
- Clinical orthopaedics and related research
Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.
- Research Article
- 10.1007/s10459-025-10462-3
- Aug 6, 2025
- Advances in health sciences education : theory and practice
Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.
- Research Article
19
- 10.2196/58158
- Jul 22, 2024
- Journal of medical Internet research
The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.
- Research Article
- 10.47172/2965-730x.sdgsreview.v5.n02.pe03303
- Jan 13, 2025
- Journal of Lifestyle and SDGs Review
Objective: The objective of this study is to assess the cognitive levels of Bloom's taxonomy that are emphasized in teaching speaking skills and evaluate how students from diverse backgrounds in Pakistani universities perceive the advanced cognitive levels of Bloom's taxonomy necessary for developing speaking skills required in the workplace. Theoretical Framework: The research's fundamental concepts and theories are based on Richards and Rodgers' language teaching model (2001). Method: This exploratory, case-based, qualitative study collected data from public and private universities in Islamabad. Data sources included the HEC Functional English selected curriculum, interviews with teachers and students, and classroom observations. Results and Discussion: The study results revealed that to improve students' speaking abilities, the HEC's speaking curriculum should emphasize all levels of the cognitive domain outlined in Bloom's taxonomy. Currently, the curriculum primarily focuses on higher levels of cognitive processing. The curriculum must be revised to include effective comprehension exercises tailored to students from various backgrounds across all cognitive domains to help students enhance their speaking skills. Research Implication: Bloom's Taxonomy is a valuable roadmap for teachers to develop students' critical skills. The findings will assist universities and curriculum designers in developing curricula aligned with Bloom's cognitive domain for the professional development of graduates. Originality/Value: This study contributes to the literature by filling a gap by examining the cognitive domain levels that enhance the speaking skills required in the workplace, which need to be highlighted within the Pakistani context.
- Research Article
- 10.3390/educsci15081029
- Aug 11, 2025
- Education Sciences
Educational assessment relies on well-constructed test items to measure student learning accurately, yet traditional item development is time-consuming and demands specialized psychometric expertise. Automatic item generation (AIG) offers template-based scalability, and recent large language model (LLM) advances promise to democratize item creation. However, fully automated approaches risk introducing factual errors, bias, and uneven difficulty. To address these challenges, we propose and evaluate a hybrid human-in-the-loop (HITL) framework for AIG that combines psychometric rigor with the linguistic flexibility of LLMs. In a Spring 2025 case study at Franklin University Switzerland, the instructor collaborated with ChatGPT (o4-mini-high) to generate parallel exam variants for two undergraduate business courses: Quantitative Reasoning and Data Mining. The instructor began by defining “radical” and “incidental” parameters to guide the model. Through iterative cycles of prompt, review, and refinement, the instructor validated content accuracy, calibrated difficulty, and mitigated bias. All interactions (including prompt templates, AI outputs, and human edits) were systematically documented, creating a transparent audit trail. Our findings demonstrate that a HITL approach to AIG can produce diverse, psychometrically equivalent exam forms with reduced development time, while preserving item validity and fairness, and potentially reducing cheating. This offers a replicable pathway for harnessing LLMs in educational measurement without sacrificing quality, equity, or accountability.
- Research Article
- 10.1002/jcal.70160
- Dec 2, 2025
- Journal of Computer Assisted Learning
Background Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning—such as sustainability—remains unclear. Objectives This study aims to evaluate the agreement between human raters and several LLMs (GPT‐4o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short‐answer responses from a university‐level Sustainability course. It also investigates how this agreement varies across cognitive skill levels. Methods A total of 232 short‐answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM‐generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap. Results Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585–0.640; r : 0.660–0.668; : 0.681–0.803). Inter‐rater reliability among humans was good to excellent (ICC: 0.667–0.800). Criterion‐level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher‐order skills. Conclusions Overall, LLM–human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.