Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices.One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students.To date, the task of crafting highquality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability.In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning.We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.* As of now, Openai does not allow fine-tune GPT-4.
- Research Article
- 10.1016/j.chest.2025.11.034
- Dec 1, 2025
- Chest
Evaluating the Accuracy of Large Language Models in Answering Asthma Multiple Choice and Objective Structured Clinical Examination Questions.
- Conference Article
- 10.1145/3711875.3729128
- Jun 23, 2025
While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.
- Research Article
- 10.32473/flairs.38.1.138995
- May 14, 2025
- The International FLAIRS Conference Proceedings
Code completion problems are an effective type of formative assessment; especially, when used to practice newly learned concepts or topics. While there is a growing body of research in computing education on the use of large language models (LLMs) to support learning content development, the use of LLMs for producing high-quality code completion problems has not yet been explored. In this paper, we analyze the capability of LLMs to generate effective distractors (i.e., plausible but incorrect options) and explanations for completion problems. We utilize common student misconceptions to improve the quality of the generated distractors. Our study suggests that LLMs are capable of generating reasonable distractors and explanations. At the same time, we identify a lack of a sufficiently granular taxonomy of common student misconceptions that would be needed for aligning the generated distractors with the common misconceptions and errors -- a gap that should be addressed in future work.
- Research Article
16
- 10.7759/cureus.81871
- Apr 8, 2025
- Cureus
Background Previous research has highlighted the potential of large language models (LLMs) in answering multiple-choice questions (MCQs) in medical physiology. However, their accuracy and reliability in specialized fields, such as blood physiology, remain underexplored. This study evaluates the performance of six free-to-use LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in solving item-analyzed MCQs on blood physiology. The findings aim to assess their suitability as educational aids. Methods This cross-sectional study at the All India Institute of Medical Sciences, Raebareli, India, involved administering a 40-item MCQ test on blood physiology to 75 first-year medical students. Item analysis utilized the Difficulty Index (DIF I), Discrimination Index (DI), and Distractor Effectiveness (DE). Internal consistency was assessed with the Kuder-Richardson 20 (KR-20) coefficient. These 40 item-analyzed MCQs were presented to six selected LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Le Chat) available as standalone Android applications on March 19, 2025. Three independent users accessed each LLM simultaneously, uploading the compiled MCQs in a Portable Document Format (PDF) file. Accuracy was determined as the percentage of correct responses averaged across all three users. Reliability was measured as the percentage of MCQs consistently answered correctly by LLM to all three users. Descriptive statistics were presented as mean ± standard deviation and percentages. Pearson's correlation coefficient or Spearman's rho was used to evaluate the associations between variables, with p < 0.05 considered significant. Results Item analysis confirmed the validity and reliability of the assessment tool, with a DIF I of 63.2 ± 20.4, a DI of 0.38 ± 0.20, a DE of 66.7 ± 33.3, and a KR-20 of 0.804. Among LLMs, Claude 3.7demonstrated the highest reliable accuracy (95%), followed by DeepSeek (93%), Grok 3 beta (93%), ChatGPT (90%), Gemini 2.0 (88%), and Mistral Le Chat (70%). No significant correlations were found between LLM performance and MCQ difficulty, discrimination power, or distractor effectiveness. Conclusions The MCQ assessment tool exhibited an appropriate difficulty level, strong discriminatory power, and adequately constructed distractors. LLMs, particularly Claude, DeepSeek, and Grok, demonstrated high accuracy and reliability in solving blood physiology MCQs, supporting their role as supplementary educational tools. LLMs handled questions of varying difficulty, discrimination power, and distractor effectiveness with similar competence. However, given occasional errors, they should be used alongside traditional teaching methods and expert supervision.
- Research Article
2
- 10.1080/10872981.2025.2592430
- Nov 29, 2025
- Medical Education Online
Large language models (LLMs) are increasingly used in healthcare and medical education, but their performance on institution-authored multiple-choice questions (MCQs), particularly with negative marking, remains unclear. To compare the examination performance of five contemporary LLMs with enrolled medical students on final multiple-choice (MCQ-style) course exams across four clinical courses. We conducted a comparative cross-sectional study at Miguel Hernández University (Spain) in 2025. Final exams in Infectious Diseases, Neurology, Respiratory Medicine, and Cardiovascular Medicine were administered under routine conditions in Spanish. Five LLMs (OpenAI o1, GPT-4o, DeepSeek R1, Microsoft Copilot, and Google Gemini 1.5 Flash) completed all MCQs in two independent runs. Scores were averaged and test–retest was estimated with Gwet’s AC1. Student scores (n = 442) were summarized as mean ± SD or median (IQR). Pairwise differences between models were explored with McNemar’s test; student–LLM contrasts were descriptive. Across courses, LLMs consistently exceeded the student median and, in several instances, the highest student score. Mean LLM courses scores ranged 7.46–9.88, versus student means 4.28–7.32. OpenAI o1 achieved the highest mean in three courses; Copilot led in Cardiovascular Medicine (text-only subset due to image limitations). All LLMs answered every MCQ and short term test–retest agreement was high (AC1 0.79–1.00). Aggregated across courses, LLMs averaged 8.75 compared with 5.76 for students. On department-set Spanish MCQ exams with negative marking, LLMs outperformed enrolled medical students, answered every item, and showed high short-term reproducibility. These findings support cautious, faculty-supervised use of LLMs as adjuncts to MCQ assessment (e.g. automated pretesting, feedback). Confirmation across institutions, languages, and image-rich formats, and evaluation of educational impact beyond accuracy are needed.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
1
- 10.1080/0142159x.2025.2497891
- May 2, 2025
- Medical Teacher
Introduction The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. Methods Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. Results Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28–0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. Discussion These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM’s response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
- Research Article
- 10.3390/dj14020072
- Jan 27, 2026
- Dentistry journal
Objective: This study aimed to compare the accuracy of two large language models (LLMs)-ChatGPT (version 3.5) and Google Gemini (formerly Bard)-in answering dental caries-related multiple-choice questions (MCQs) using a simulated student examination framework across seven examination lengths. Materials and Methods: A total of 125 validated dental caries MCQs were extracted from Dental Decks and Oxford University Press question banks. Seven examination groups were constructed with varying question counts (25, 35, 45, 55, 65, 75, and 85 questions). For each group, 100 simulations were generated per LLM (ChatGPT and Gemini), resulting in 1400 simulated examinations. Each simulated student received a unique randomized subset of questions. MCQs were answered by each LLM using a standardized prompt to minimize ambiguity. Outcomes included mean score, passing rate (≥60%), and performance differences between LLMs. Statistical analyses included independent t-tests, one-way ANOVA within each LLM, and two-way ANOVA examining interactions between LLM type and question count. Results: Across all seven examination formats, Gemini significantly outperformed ChatGPT (p < 0.001). Gemini achieved higher passing rates and higher mean scores in every examination length. One-way ANOVA revealed significant score variation with increasing exam length for both LLMs (p < 0.05). Two-way ANOVA demonstrated significant main effects of LLM type and question count, with no significant interaction. Randomization had no measurable effect on Gemini performance but influenced ChatGPT scores. Conclusions: Gemini demonstrated superior accuracy and higher passing rates compared to ChatGPT in all simulated examination formats. While both LLMs struggled with complex caries-related content, Gemini provided more reliable performance across question quantities. Educators should exercise caution in relying on LLMs for automated assessment or self-study, and future research should evaluate human-AI hybrid models and LLM performance across broader dental domains.
- Abstract
3
- 10.1182/blood-2023-185854
- Nov 2, 2023
- Blood
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
- Research Article
- 10.3390/diagnostics15151848
- Jul 22, 2025
- Diagnostics
Background/Objectives: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. Methods: Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. Results: Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions (n = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items (n = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. Conclusions: Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management.
- Research Article
6
- 10.1002/bcp.70137
- Jun 10, 2025
- British journal of clinical pharmacology
In medical education, the ability of large language models (LLMs) to match human performance raises questions about their potential as educational tools. This study evaluates LLMs' performance on Clinical Pharmacology and Therapeutics (CPT) exams, comparing their results to medical students and exploring their ability to identify poorly formulated multiple-choice questions (MCQs). ChatGPT-4 Omni, Gemini Advanced, Le Chat and DeepSeek R1 were tested on local CPT exams (third year of bachelor's degree, first/second year of master's degree) and the European Prescribing Exam (EuroPE+). The exams included MCQs and open-ended questions assessing knowledge and prescribing skills. LLM results were analysed using the same scoring system as students. A confusion matrix was used to evaluate the ability of ChatGPT and Gemini to identify ambiguous/erroneous MCQs. LLMs achieved comparable or superior results to medical students across all levels. For local exams, LLMs outperformed M1 students and matched L3 and M2 students. In EuroPE+, LLMs significantly outperformed students in both the knowledge and prescribing skills sections. All LLM errors in EuroPE+ were genuine (100%), whereas local exam errors were frequently due to ambiguities or correction flaws (24.3%). When both ChatGPT and Gemini provided the same incorrect answer to an MCQ, the specificity for detecting ambiguous questions was 92.9%, with a negative predictive value of 85.5%. LLMs demonstrate capabilities comparable to or exceeding medical students in CPT exams. Their ability to flag potentially flawed MCQs highlights their value not only as educational tools but also as quality control instruments in exam preparation.
- Research Article
- 10.1007/s11695-025-08418-y
- Dec 11, 2025
- Obesity surgery
The rapid integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their performance in specialized medical fields. In metabolic bariatric surgery (MBS), LLMs have the potential to revolutionize education and clinical support, yet their accuracy and reliability are not well-established. This study provides a critical assessment of the capabilities of current LLMs in the context of MBS. This cross-sectional validation study assessed the performance of six LLMs (ChatGPT-3.5, ChatGPT-4o, Gemini, Copilot, GROK, and DeepSeek) in answering 100 evidence-based binary and multiple-choice questions related to MBS. Questions were constructed from international guidelines and categorized into six thematic domains. Expert consensus answers served as the reference standard, with inter-rater reliability measured using Fleiss’ κ. Model outputs were scored for accuracy. Comparisons across LLMs were first assessed using an overall test for differences between multiple related groups. Pairwise comparisons were then conducted between LLMs to identify specific differences in performance. Across the dataset, the mean number of correct LLM responses per question was 3.9 (SD = 1.8). ChatGPT-4o achieved the highest accuracy (66.0%), while DeepSeek recorded the lowest (60.0%). Accuracy varied across domains, highest for indications/contraindications (78.7%) and complications/management (68.0%), and lowest for preoperative preparation (52.0%) and postoperative care (58.4%). Binary questions yielded higher accuracy (69.1%) than multiple-choice questions (62.0%). Inter-expert reliability was substantial (κ = 0.742, 95% CI: 0.71–0.77). Agreement between LLMs and experts ranged from fair (DeepSeek κ = 0.349) to moderate (ChatGPT-4o κ = 0.446). No significant accuracy differences were detected across models (Friedman test, p = 0.662). LLMs represent a promising, yet imperfect, adjunct in MBS education. Their utility is currently limited by inconsistencies in accuracy, particularly in areas requiring nuanced clinical judgment. While these models can supplement traditional learning resources, they are not yet a substitute for expert clinical guidance. This study underscores the need for continued refinement and validation of LLMs to ensure their safe and effective integration into clinical practice. LLMs show moderate accuracy in bariatric surgery education, strongest in guideline-based domains. Newer models (ChatGPT-4o, Gemini, Copilot) performed slightly better, but gains were modest. Accuracy was higher for binary than multiple-choice questions.
- Research Article
9
- 10.2196/69910
- May 20, 2025
- Journal of medical Internet research
Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings. This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs). A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence. On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001). To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.
- Research Article
- 10.1016/j.prosdent.2026.03.043
- Apr 15, 2026
- The Journal of prosthetic dentistry
Evaluating the performance of large language models versus prosthodontic residents on the 2024 and 2025 National Prosthodontic Resident Examination.
- Research Article
- 10.1038/s41598-026-48326-4
- Apr 17, 2026
- Scientific reports
Large language models (LLMs) are increasingly incorporated into medical education and clinical learning environments. While prior studies have focused on model accuracy on licensing-style examinations, less attention has been paid to the stability and reproducibility of LLM clinical reasoning under varying input structures-an issue central to safe educational and clinical deployment. To examine how question delivery structure influences performance stability, inter-model variability, and reproducibility of clinical reasoning across multiple contemporary LLMs using pediatric residency-level multiple-choice questions (MCQs). A standardized, evidence-based prompt was used to generate advanced-level pediatric USMLE Step 2/3-style MCQs emphasizing diagnostic reasoning, management decisions, and ethical judgment. One hundred draft MCQs generated using a standardized LLM prompt were randomly selected and independently reviewed by three pediatric physicians for medical accuracy, clinical realism, subspecialty relevance, and adherence to USMLE formatting. Twenty three questions were excluded by unanimous consensus, yielding a validated set of 77 MCQs. Six publicly available LLMs (ChatGPT, DeepSeek AI, Gemini, Microsoft Copilot, Perplexity AI, and OpenEvidence; October-December 2025 versions) were evaluated under two delivery conditions: (1) simultaneous presentation of all questions and (2) sequential delivery in batches of ten. Accuracy and inter-model variability were compared using paired t-tests and one-way ANOVA. When all questions were presented simultaneously, model accuracy varied widely (38%-90%), with significant inter-model differences, indicating poor reproducibility. In contrast, batch delivery resulted in marked convergence of performance across models (83%-88%), with no statistically significant inter-model differences. Sequential delivery in batches of ten substantially reduced performance dispersion and instability across all evaluated systems. LLM clinical reasoning performance is highly sensitive to input structure. Reducing contextual load through structured batch delivery improves reproducibility and minimizes inter-model variability, independent of model architecture. These findings suggest that prompt structure-rather than model selection alone-is a critical determinant of reliable LLM behavior and should be explicitly considered in the design of AI-supported medical education and assessment systems. This should be explicitly considered in the design of AI-supported medical education and assessment systems, particularly when LLMs are used as formative learning tools or clinical reasoning aids. Clinical Trial Number. The protocol was reviewed by the Office of the IRB at Good Samaritan University Hospital and determined to be exempt.