The Efficacy of Using a Large-Language Model as an Item Writing Assistant
ABSTRACT The authors investigated using a large language model (LLM) for writing test questions for a real estate licensing exam. In Study 1 items were generated by GPT-4 and rated by subject matter experts (SMEs). These items were on-topic,relevant, and generally appropriate. Item difficulty manipulation was ineffective. Cognitive level matching was harder as cognitive level increased. Study 2 compared human and LLM items using SME and content developer ratings. Human and LLM items were similar in blueprint alignment, relevance, factual errors, and key quality. LLM items had better stem quality and cognitive level matching. Human distractors had an edge in quality. In Study 3 investigated content overlap and breadth of coverage. Similar prompts frequently generated overlapping content. The range of content represented in large sets of generated items did not cover the breadth of the generating content areas. Results suggest LLMs are as good as SMEs at generating first-draft items.
- Research Article
8
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Research Article
- 10.2196/72984
- Jul 31, 2025
- Journal of Medical Internet Research
BackgroundRecognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.ObjectiveThe primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.MethodsFour LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children’s Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children’s Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)–based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.ResultsSymptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10–based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10–based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10–based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).ConclusionsLLMs significantly outperformed an ICD-10–based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.
- Research Article
1
- 10.1080/0142159x.2025.2497891
- May 2, 2025
- Medical Teacher
Introduction The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. Methods Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. Results Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28–0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. Discussion These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM’s response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
- Research Article
4
- 10.1016/j.ajem.2024.12.032
- Mar 1, 2025
- The American journal of emergency medicine
Use of a large language model (LLM) for ambulance dispatch and triage.
- Research Article
- 10.1158/1557-3265.aimachine-b012
- Jul 10, 2025
- Clinical Cancer Research
Background: Large language models (LLMs) excel on standardized oncology exams; however, their broader clinical utility remains unclear. LLMs are easy to use through “prompting.” For example, a doctor or patient can provide a clinical note and ask about the probability of an adverse event (AE). Current AE prediction relies on machine learning using tabular data, which requires substantial engineering to adapt to specific tasks and settings, making them costly and less generalizable. We compared prompting LLMs against tabular ML models to predict AEs during systemic cancer therapy. Materials an. Methods: Patients with aerodigestive cancers at Princess Margaret Cancer Centre who received their first systemic therapy from 2008 to 2015 formed the development set, and from 2016 to 2018 formed the test set. We evaluated different prompting strategies with open-source LLMs using the de-identified consult and most recent pre-treatment note from each patient to predict the risk of clinical, symptom, and laboratory AEs. An ensemble of ML models was trained on tabular electronic health record data for comparison. We measured performance with the area under the receiver-operating characteristic curve (AUC). Using an established schema, an oncologist reviewed the text-based justifications from 20 random LLM predictions. Results The cohort included 6,381 patients. Notes had a median token length of 1,737 (range 137-7,795). The LLM Qwen 2.5 14B achieved the best AUC across 14 of 19 AEs in the development set. The larger 14B model outperformed the 7B model on all targets (p = 4e-5). Among prompting strategies, no benefit was observed with the oncologist versus AI model persona (p = 0.21), chain-of-thought reasoning (p = 0.23), or concatenating tabular data to notes (p = 0.42). In the test cohort, LLMs and tabular ML showed equivalent performance for some AEs, such as death within 30 days (LLM AUC: 0.73 [95% CI 0.66, 0.80], versus [v.] ML: 0.74 [0.67, 0.81], p = 0.89) and hyperbilirubinemia (0.79 [0.72, 0.86] v. 0.78 [0.70, 0.85], p = 0.77). For other AEs, performance was numerically similar, such as death in one year (0.72 [0.70, 0.74] v. 0.76 [0.73, 0.78], p = 0.02) and anemia (0.78 [0.75, 0.80] v. 0.82 [0.8, 0.84], p = 0.01). LLMs performed worse for symptom-related AEs, such as pain (0.48 [0.44, 0.53] v. 0.69 [0.65, 0.74], p = 1e-11) and tiredness (0.49 [0.45, 0.52] v. 0.69 [0.65, 0.72], p = 2e-14). The oncologist deemed LLM justifications satisfactory across all dimensions for at least 90% of predictions, except that 20% had factual consistency errors. Conclusion: Prompting LLMs performed similarly to engineered tabular ML models for predicting several AEs, despite using only raw text from notes. Better performance with larger models suggests the gap between LLMs and ML models may continue to narrow. This work lays the foundation for using LLMs as general-purpose clinical decision-support tools for cancer care. Citation Format: Wayne Isaac T. Uy, Galileo Arturo Gonzalez Conchas, Jiang Chen He, Muammar Kabir, Baijiang Yuan, Geoffrey Liu, Sharon Narine, Melanie Powis, Benjamin Grant, Mattea Welch, Tran Truong, Robert Grant. Prompting Large Language Models to Predict Adverse Events during Cancer Treatment [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr B012.
- Research Article
1
- 10.1155/int/2376097
- Jan 1, 2025
- International Journal of Intelligent Systems
The emergence of large language models (LLMs) has substantially changed the artificial intelligence field, enabling its wide use over different domains. As various LLM alternatives have been developed, the current study proposes a novel decision‐support framework for evaluating and benchmarking LLMs based on multicriteria decision‐making (MCDM) techniques. In the proposed framework, an improved version of the best‐worst method (BWM) is proposed to effectively reduce the computational complexity of assigning a critical weight for the evaluation criteria of LLMs. Then, the improved BWM is integrated with the combined compromise solution (CoCoSo) method for ranking LLM alternatives. Findings show that the improved BWM successfully computes the criteria weights with low computational complexity compared to the original BWM. According to the enhanced BWM, the ‘factual errors’ criterion received the highest significant weight (0.2681), while the ‘logical inconsistencies’ criteria obtained the lowest (0.0827). The rest of the criteria were distributed in between that range. Subsequently, CoCoSo ranked the involved LLM alternatives in two different runs based on the extracted weights. Sensitivity analysis was employed to evaluate the effect of the assessment criteria on LLMs’ evaluation.
- Research Article
11
- 10.1038/s41598-024-60405-y
- May 11, 2024
- Scientific Reports
Large language models (LLMs), like ChatGPT, Google’s Bard, and Anthropic’s Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0–5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs’ capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.
- Research Article
1
- 10.21248/jlcl.38.2025.280
- Jul 8, 2025
- Journal for Language Technology and Computational Linguistics
Text-generative large language models (LLMs) offer promising possibilities for terminology work, including term extraction, definition creation and assessment of concept relations. This study examines the performance of ChatGPT, Perplexity and Microsoft CoPilot for conducting terminology work in the field of the Austrian and British higher education systems using strategic prompting frameworks. Despite efforts to refine prompts by specifying language variety and system context, the LLM outputs failed to reliably differentiate between the Austrian and German systems and fabricated terms. Factors such as the distribution of German-language training data,potential pivot translation via English and the lack of transparency in LLM training further complicated evaluation. Additionally, output variability across identical prompts highlights the unpredictability of LLM-generated terminology. The study underscores the importance of human expertise in evaluating LLM outputs, as inconsistencies may undermine the reliability of terminology derived from such models. Without domain-specific knowledge (encompassing both subject-matter expertise and familiarity with terminology principles) as well as LLM literacy, users are unable to critically assess the quality of LLM outputs in terminological contexts. Rather than indiscriminately applying LLMs to all aspects of terminology work, it is crucial to assess their suitability for specific tasks.
- Research Article
5
- 10.1272/jnms.jnms.2024_91-205
- Apr 25, 2024
- Journal of Nippon Medical School
Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear. To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses. The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions. An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.
- Research Article
- 10.1093/eurjcn/zvaf122.073
- Jul 24, 2025
- European Journal of Cardiovascular Nursing
Background A co-designed online support programme was developed with and for informal carers of persons with heart failure (HF) and co-created with practitioners with expertise in HF and informal care. While the programme can provide knowledge and preparedness, it may be seen as time-consuming and extensive, reducing its usefulness. Therefore, an online chatbot prototype is being developed to facilitate direct conversations addressing carers' specific needs. The prototype uses Retrieval-Augmented Generation (RAG), integrating a Large Language Model (LLM), specifically GPT-4o, provided via Copilot and Microsoft Azure, with context-specific information (i.e. the support programme). Aim This feasibility study aims to evaluate and compare responses generated by the chatbot protype against responses from an LLM without context-specific information. Methods A 'human-in-the-loop' approach was used to evaluate and compare responses from 1) the chatbot prototype, 2) an LLM prompted to support carers of persons with HF, and 3) an 'unprompted' LLM. Evaluation criteria included the relevance of the RAG's choice of context and the correctness of responses. Incorrect answers indicated irrelevance or factual errors. An overall assessment considered alignment with the support programme's content and tonality. Five questions were chosen for evaluation and these focused on managing emotions, practical concerns related to HF, understanding HF, distinguishing its symptoms from anxiety, intimacy, and end-of-life concerns. Two researchers with expertise in heart failure and informal care conducted the evaluation. Results The RAG's context choice for each question was relevant, though three questions could have utilised more content from the support programme. Since the prototype chatbot and the prompted and unprompted LLM versions use GPT-4o, their answers share similarities. However, the answers from the chatbot prototype were considered more appropriate for four out of five questions, aligning better with the support programme's co-designed content. This indicates alignment with carers' wishes and needs, as well as professional knowledge and experience. Some answers overinterpret carers' situations, referring to the carer's ‘tough situation’ or suggesting they are overwhelmed, even though the question did not imply such feelings. Additionally, the LLMs often provided longer, list-based answers, which may be less reader-friendly. Conclusions The results suggest that using an LLM with context-specific information provides better answers than without this. The results motivates further prototype development to ensure answers utilise all relevant parts of the support programme. Additionally, it motivates exploring and further testing with informal carers, practitioners, designers, and, researchers to evaluate the relevance and tonality of LLM responses, ensuring they align with carers' preferences when receiving AI-based chatbot support.
- Research Article
2
- 10.1016/j.mcpdig.2024.09.006
- Oct 19, 2024
- Mayo Clinic Proceedings: Digital Health
Evaluating Large Language Model–Supported Instructions for Medication Use: First Steps Toward a Comprehensive Model
- Research Article
1
- 10.1371/journal.pone.0312078
- Dec 12, 2024
- PloS one
Policy epidemiology utilizes human subject-matter experts (SMEs) to systematically surface, analyze, and categorize legally-enforceable policies. The Analysis and Mapping of Policies for Emerging Infectious Diseases project systematically collects and assesses health-related policies from all United Nations Member States. The recent proliferation of generative artificial intelligence (GAI) tools powered by large language models have led to suggestions that such technologies be incorporated into our project and similar research efforts to decrease the human resources required. To test the accuracy and precision of GAI in identifying and interpreting health policies, we designed a study to systematically assess the responses produced by a GAI tool versus those produced by a SME. We used two validated policy datasets, on emergency and childhood vaccination policy and quarantine and isolation policy in each United Nations Member State. We found that the SME and GAI tool were concordant 78.09% and 67.01% of the time respectively. It also significantly hastened the data collection processes. However, our analysis of non-concordant results revealed systematic inaccuracies and imprecision across different World Health Organization regions. Regarding vaccination, over 50% of countries in the African, Southeast Asian, and Eastern Mediterranean regions were inaccurately represented in GAI responses. This trend was similar for quarantine and isolation, with the African and Eastern Mediterranean regions least concordant. Furthermore, GAI responses only provided laws or information missed by the SME 2.14% and 2.48% of the time for the vaccination dataset and for the quarantine and isolation dataset, respectively. Notably, the GAI was least concordant with the SME when tasked with policy interpretation. These results suggest that GAI tools require further development to accurately identify policies across diverse global regions and interpret context-specific information. However, we found that GAI is a useful tool for quality assurance and quality control processes in health policy identification.
- Research Article
1
- 10.1152/advan.00106.2024
- Jun 14, 2025
- Advances in physiology education
Multiple choice questions (MCQs) are frequently used in medical education for assessment. Automated generation of MCQs in board-exam format could potentially save significant effort for faculty and generate a wider set of practice materials for student use. The goal of this study was to explore the feasibility of using ChatGPT by OpenAI to generate United States Medical Licensing Exam (USMLE)/Comprehensive Osteopathic Medical Licensing Examination (COMLEX-USA)-style practice quiz items as study aids. Researchers gave second-year medical students studying renal physiology access to a set of practice quizzes with ChatGPT-generated questions. The exam items generated were evaluated by independent experts for quality and adherence to the National Board of Medical Examiners (NBME)/National Board of Osteopathic Medical Examiners (NBOME) guidelines. Forty-nine percent of questions contained item writing flaws, and 22% contained factual or conceptual errors. However, 59/65 (91%) were categorized as a reasonable starting point for revision. These results demonstrate the feasibility of large language model (LLM)-generated practice questions in medical education but only when supervised by a subject matter expert with training in exam item writing.NEW & NOTEWORTHY Practice board exam questions generated by large language models can be made suitable for preclinical medical students by subject-matter experts.
- Research Article
1
- 10.3390/computers14060210
- May 28, 2025
- Computers
Clinical documentation, particularly the hospital discharge report (HDR), is essential for ensuring continuity of care, yet its preparation is time-consuming and places a considerable clinical and administrative burden on healthcare professionals. Recent advancements in Generative Artificial Intelligence (GenAI) and the use of prompt engineering in large language models (LLMs) offer opportunities to automate parts of this process, improving efficiency and documentation quality while reducing administrative workload. This study aims to design a digital system based on LLMs capable of automatically generating HDRs using information from clinical course notes and emergency care reports. The system was developed through iterative cycles, integrating various instruction flows and evaluating five different LLMs combined with prompt engineering strategies and agent-based architectures. Throughout the development, more than 60 discharge reports were generated and assessed, leading to continuous system refinement. In the production phase, 40 pneumology discharge reports were produced, receiving positive feedback from physicians, with an average score of 2.9 out of 4, indicating the system’s usefulness, with only minor edits needed in most cases. The ongoing expansion of the system to additional services and its integration within a hospital electronic system highlights the potential of LLMs, when combined with effective prompt engineering and agent-based architectures, to generate high-quality medical content and provide meaningful support to healthcare professionals. Hospital discharge reports (HDRs) are pivotal for continuity of care but consume substantial clinician time. Generative AI systems based on large language models (LLMs) could streamline this process, provided they deliver accurate, multilingual, and workflow-compatible outputs. We pursued a three-stage, design-science approach. Proof-of-concept: five state-of-the-art LLMs were benchmarked with multi-agent prompting to produce sample HDRs and define the optimal agent structure. Prototype: 60 HDRs spanning six specialties were generated and compared with clinician originals using ROUGE with average scores compatible with specialized news summarizing models in Spanish and Catalan (lower scores). A qualitative audit of 27 HDR pairs showed recurrent divergences in medication dose (56%) and social context (52%). Pilot deployment: The AI-HDR service was embedded in the hospital’s electronic health record. In the pilot, 47 HDRs were autogenerated in real-world settings and reviewed by attending physicians. Missing information and factual errors were flagged in 53% and 47% of drafts, respectively, while written assessments diminished the importance of these errors. An LLM-driven, agent-orchestrated pipeline can safely draft real-world HDRs, cutting administrative overhead while achieving clinician-acceptable quality, not without errors that require human supervision. Future work should refine specialty-specific prompts to curb omissions, add temporal consistency checks to prevent outdated data propagation, and validate time savings and clinical impact in multi-center trials.
- Research Article
- 10.2196/81807
- Nov 13, 2025
- JMIR Medical Education
BackgroundMock examinations are widely used in health professional education to assess learning and prepare candidates for national licensure. However, instructor-written multiple-choice items can vary in difficulty, coverage, and clarity. Recently, large language models (LLMs) have achieved high accuracy in medical examinations, highlighting their potential for assisting item-bank development; however, their educational quality remains insufficiently characterized.ObjectiveThis study aimed to (1) identify the most accurate LLM for the Japanese National Examination for Radiological Technologists and (2) use the top model to generate blueprint-aligned multiple-choice questions and evaluate their educational quality.MethodsFour LLMs—OpenAI o3, o4-mini, o4-mini-high (OpenAI), and Gemini 2.5 Flash (Google)—were evaluated on all 200 items of the 77th Japanese National Examination for Radiological Technologists in 2025. Accuracy was analyzed for overall items and for 173 nonimage items. The best-performing model (o3) then generated 192 original items across 14 subjects by matching the official blueprint (image-based items were excluded). Subject-matter experts (≥5 y as coordinators and routine mock examination authors) independently rated each generated item on five criteria using a 5-point scale (1=unacceptable, 5=adoptable): item difficulty, factual accuracy, accuracy of content coverage, appropriateness of wording, and instructional usefulness. Cochran Q with Bonferroni-adjusted McNemar tests compared model accuracies, and one-sided Wilcoxon signed-rank tests assessed whether the median ratings exceeded 4.ResultsOpenAI o3 achieved the highest accuracy overall (90.0%; 95% CI 85.1%‐93.4%) and on nonimage items (92.5%; 95% CI 87.6%‐95.6%), significantly outperforming o4-mini on the full set (P=.02). Across models, accuracy differences on the non-image subset were not significant (Cochran Q, P=.10). Using o3, the 192 generated items received high expert ratings for item difficulty (mean, 4.29; 95% CI 4.11‐4.46), factual accuracy (4.18; 95% CI 3.98‐4.38), and content coverage (4.73; 95% CI 4.60‐4.86). Ratings were comparatively lower for appropriateness of wording (3.92; 95% CI 3.73‐4.11) and instructional usefulness (3.60; 95% CI 3.41‐3.80). For these two criteria, the tests did not support a median rating >4 (one-sided Wilcoxon, P=.45 and P≥.99, respectively). Representative low-rated examples (ratings 1‐2) and the rationale for those scores—such as ambiguous phrasing or generic explanations without linkage to stem cues—are provided in the supplementary materials.ConclusionsOpenAI o3 can generate radiological licensure items that align with national standards in terms of difficulty, factual correctness, and blueprint coverage. However, wording clarity and the pedagogical specificity of explanations were weaker and did not meet an adoptable threshold without further editorial refinement. These findings support a practical workflow in which LLMs draft syllabus-aligned items at scale, while faculty perform targeted edits to ensure clarity and formative feedback. Future studies should evaluate image-inclusive generation, use Application Programming Interface (API)-pinned model snapshots to increase reproducibility, and develop guidance to improve explanation quality for learner remediation.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.