Advancing Item Writing Methods with Large Language Models and Subject Matter Experts
ABSTRACT This paper examines the use of generative large language models (LLM) to produce selected-response (SR) test items. In study 1, we compared 201 prompt-engineered LLM-generated items to 52 items developed by subject matter experts (SME). We found that 60% of LLM-generated items met psychometric criteria, compared to 74% of SME items. LLM-generated items had an average difficulty (0.73) and discrimination (0.16) similar to SME items (0.78 and 0.18). However, LLM-generated items exhibited a higher frequency of psychometric flags (e.g., high difficulty) and higher initial rejection rates. In study 2, a fine-tuned LLM generated 154 SR items across two batches. Fine-tuning improved item retention rates by 16% when compared to prompt engineering and by 10% between batches. Batch performance showed stable difficulty (0.69) and improved discrimination (0.24 to .27). While LLM cannot yet replace SME-driven item development, it may offer efficiency gains when integrated into the item development model.
- Research Article
26
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Research Article
1
- 10.1080/0142159x.2025.2497891
- May 2, 2025
- Medical Teacher
Introduction The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. Methods Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. Results Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28–0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. Discussion These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM’s response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
- Conference Article
- 10.1145/3711875.3729128
- Jun 23, 2025
While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.
- Research Article
3
- 10.1109/access.2024.3419079
- Jan 1, 2024
- IEEE Access
Large language models’ exceptional all-purpose abilities have made human-computer conversations normal, but for particular industries and verticals, they fall short of enhancing the expertise of knowledge and the timeliness of information. In order to give current information, and provide improved search capabilities, large language models need to increasingly incorporate specialist resources and databases. In this research, a model for intelligent assisted decision-making was proposed that the model incorporates knowledge from domain-specific databases and real-time data and uses large language models to offer expert tax guidance. The research proposed to overcome the limits of general-purpose language models and deliver specialized advise for tax-related inquiries by complementing large language models with domain-specific information.The results we achieve demonstrate that by offering tax advice tailored to a given situation, and the model we proposed goes beyond the validity of general large language language models. Our contribution is that not only exploring the combination of tax area and large language model, but also proposing a new effective model for government tax department to use in real life. This study highlights the potential of big language models for use in real-world professional domains and advances the field of domain-specific human-computer interaction.
- Research Article
- 10.1093/ajcp/aqaf153
- Jan 5, 2026
- American journal of clinical pathology
To evaluate the ability of 4 artificial intelligence large language models (LLMs) to create items that align with the item writing standards of the American Board of Pathology (ABPath) for continuing certification. An informatics item writing application was developed and used with prompts based on the ABPath item writing standards. Uniform prompts were used for the LLMs evaluated, with the content of the items generated tailored to the expertise of the reviewing subject matter experts (SMEs). The SMEs were blinded to the identity of the LLM that generated each item. The 14 SMEs graded 4 written items and 4 practical items, with 1 item from each set of 4 generated from each of the LLMs. The 19 questions used for grading concentrated on item anatomy (ie, item structure), accuracy, relevance, and level of item difficulty. The overall scores for the 4 LLMs for the written items were as follows: Claude, 229 of 266 (86.1%); ChatGPT, 212 of 266 (79.7%); Llama, 175 of 266 (65.8%); and Titan, 162 of 266 (60.9%). The overall scores for the 4 LLMs for the practical items were as follows: Claude, 247 of 266 (92.9%); ChatGPT, 216 of 266 (81.2%); Llama, 175 of 266 (65.8%); and Titan, 151 of 266 (56.8%). Statistically significant differences existed between the LLMs. We observed significant differences in the ability of the 4 LLMs evaluated to draft items consistent with the ABPath guidelines based on SME scoring. It is important to assess the various LLMs available to determine which model best meets the needs of the user for the proposed task and not to assume equivalence.
- Research Article
1
- 10.1007/s40747-025-02143-w
- Dec 15, 2025
- Complex & Intelligent Systems
Within software development life-cycle, requirements guide the entire development process from inception to completion by ensuring alignment between stakeholder expectations and the final product. Requirements extraction from miscellaneous information is a challenging and complex task. Manual extraction of requirements is not only prone to human error but also contributes to increased project costs and delayed project timelines. To automate the requirement extraction process, researchers have investigated the potential of deep learning architectures, large language models (LLM) and generative language models such as ChatGPT and Gemini. However, the performance of requirements extraction could be further enhanced through the development of predictive pipelines by utilizing the combined potential of language models and deep learning architectures. To develop a powerful AI application for requirements extraction by utilizing the combined potential of LLMs and DL architectures, this study presents ReqNet framework. The framework encompasses 7 most widely used LLMs variants (small, large, Xlarge, XXlarge) and 2 DL architectures (LSTM, GRU). The framework facilitates the development of three distinct types predictive pipelines, namely standalone LLMs, LLMs + external classifiers and an ensemble of multiple LLMs representation + external classifiers. Extensive experimentation of 48 predictive pipelines across 2 public core datasets and 1 independent test set, demonstrates that predictive pipelines made up from LLMs and DL architectures generally exhibited superior performance compared to pipelines solely reliant on LLMs. In addition, a ensemble of three distinct LLMs (ALBERT, BERT and XLNet) and LSTM classifier achieved a 3% improvement in F1-score over state-of-the-art predictors on the PURE dataset, a 10% improvement on the Dronology dataset and a 3% improvement on the RFI independent test set.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
183
- 10.2196/51580
- Dec 28, 2023
- Journal of Medical Internet Research
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy. This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry. The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers. Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate. This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.
- Research Article
3
- 10.2196/72984
- Jul 31, 2025
- Journal of Medical Internet Research
BackgroundRecognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.ObjectiveThe primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.MethodsFour LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children’s Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children’s Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)–based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.ResultsSymptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10–based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10–based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10–based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).ConclusionsLLMs significantly outperformed an ICD-10–based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.
- Research Article
2
- 10.1609/aaai.v39i1.32018
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive data. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle scenarios where trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike other works that assume access to cleanly trained models, our safety-enhanced LLMs are able to revoke backdoors without any reference. Consequently, our safety-enhanced LLMs no longer produce targeted responses when the backdoor triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability.
- Research Article
3
- 10.1145/3728971
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risks efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM M t and a novel safety critique LLM M c . The expert testing LLM M t is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM M c can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing M t and conduct safety evaluations for 21 influential LLMs via M c on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and M c not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.
- Research Article
- 10.1111/ijsa.70021
- Aug 1, 2025
- International Journal of Selection and Assessment
ABSTRACTThe use of generative AI, specifically large language models (LLMs), in test development presents an innovative approach to efficiently creating technical, knowledge‐based assessment items. This study evaluates the efficacy of AI‐generated items compared to human‐authored counterparts within the context of employee selection testing, focusing on data science knowledge areas. Through a paired comparison approach, subject matter experts (SMEs) were asked to evaluate items produced by both LLMs and human item writers. Findings revealed a significant preference for LLM‐generated items, particularly in specific knowledge domains such as Statistical Foundations and Scientific Data Analysis. However, despite the promise of generative AI in accelerating item development, human review remains critical. Issues such as multiple correct answers or ineffective distractors in AI‐generated items necessitate thorough SME review and revision to ensure quality and validity. The study highlights the potential of integrating AI with human expertise to enhance the efficiency of item generation while maintaining psychometric standards in high‐stakes environments. The implications for psychometric practice and the necessity of domain‐specific validation are discussed, offering a framework for future research and application of AI in test development.
- Research Article
17
- 10.1002/jrsm.1749
- Aug 23, 2024
- Research synthesis methods
Existing systems for automating the assessment of risk-of-bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general- and medical-domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1-0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.
- Research Article
67
- 10.1111/bjet.13494
- Jun 4, 2024
- British Journal of Educational Technology
This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine‐tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine‐tuning and adjustment are needed for a more nuanced understanding of non‐objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research. Practitioner notes What is already known about this topic Large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human‐like conversation and writing. LLMs can perform various tasks, including essay grading. Intraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs). What this paper adds The study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale. It provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores. It specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading. Implications for practice and/or policy The findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators. The paper's insistence on the need for further fine‐tuning of LLMs underlines the continual interplay between technological advancement and its practical applications. The results lay down a footprint for future research in advancing the use of AI in essay grading.
- Research Article
6
- 10.1017/pan.2025.10017
- Sep 19, 2025
- Political Analysis
Codebooks—documents that operationalize concepts and outline annotation procedures—are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasingly turning to generative large language models (LLMs). However, there is limited empirical evidence on whether “off-the-shelf” LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks—covering protest events, political violence, and manifestos—along with their unstructured texts and human-coded labels. We also propose a five-stage framework for codebook-LLM measurement: Preparing a codebook for both humans and LLMs, testing LLMs’ basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e., off-the-shelf performance), analyzing errors, and further (parameter-efficient) supervised training of LLMs. We provide an empirical demonstration of this framework using our three codebook datasets and several pre-trained 7–12 billion open-weight LLMs. We find current open-weight LLMs have limitations in following codebooks zero-shot, but that supervised instruction-tuning can substantially improve performance. Rather than suggesting the “best” LLM, our contribution lies in our codebook datasets, evaluation framework, and guidance for applied researchers who wish to implement their own codebook-LLM measurement projects.