MORepair : Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.

Similar Papers
  • Research Article
  • Cite Count Icon 11
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v39i1.32046
Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Pedro Orvalho + 2 more

Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on programming assignments requires substantial time and effort from faculty, personalized automated feedback often involves suggesting repairs to students' programs. Symbolic semantic repair approaches, which rely on Formal Methods (FM), check a program's execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for program repair but often make extensive rewrites instead of minimal adjustments. This tends to lead to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FM-based fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a Counterexample Guided Inductive Synthesis (CEGIS) loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM for revised synthesis. Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs and produce smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.

  • Research Article
  • 10.1007/s00330-026-12445-3
Comparison of proprietary and fine-tuned large language models for multi-label classification of billing codes from radiology reports.
  • Mar 14, 2026
  • European radiology
  • Kamyar Arzideh + 12 more

While large language models (LLMs) have shown promise in medical text analysis, their application in automated medical billing code extraction remains underexplored, particularly for the German medical fee schedule system (GOÄ). Therefore, an LLM was fine-tuned to perform multi-label classification of GOÄ codes from radiology reports automatically, and its performance was compared with state-of-the-art commercial and open-source LLMs. Following ethics committee approval, we analyzed 499,601 radiology reports from 124,497 patients, containing 1,799,971 manually identified GOÄ codes as ground truth. The MediPhi-Instruct 4B model was fine-tuned using five-fold cross-validation. Performance was evaluated on the hold-out test set and compared against GPT-5, GPT-4.1, GPT-oss, Kimi-K2, Deepseek-R1, Deepseek-V3, Gemini 2.5, Llama-70B, and Qwen-3 LLMs on a subset of 500 anonymized and 350 cleaned reports using zero-shot and few-shot prompting techniques. The fine-tuned model achieved an accuracy of 77.15% ± 0.47% and a micro-average F1-score of 87.79% ± 0.31% on the hold-out test set. On a subset of 500 real-world samples, our models outperformed the best-performing LLM, Gemini 2.5 Flash, with an F1-score of 70.32% ± 1.54% compared to 58.22% ± 1.50% (p < 0.001). For the cleaned dataset of 350 samples, GPT-5 achieved the best F1-score of 89.51 ± 1.52% and outperformed the fine-tuned models (p < 0.001). Fine-tuned LLMs can effectively automate GOÄ code classification from radiology reports, with the potential of outperforming commercial LLMs. This approach shows promise for improving billing efficiency and accuracy in healthcare settings, though manual verification is still recommended. Question LLMs with high parameters possess medical knowledge, but how effective are they at predicting billing codes from radiology reports compared to smaller, fine-tuned models? Finidngs A fine-tuned ensemble model achieved competitive results and can outperform larger, proprietary LLMs. Clinical relevance Smaller, fine-tuned models offer an efficient alternative to proprietary LLMs in generating billing codes and can be integrated to assist clinical coding. This technology has the potential to transform clinical billing procedures, but its use should be overseen by qualified professional personnel.

  • Research Article
  • Cite Count Icon 14
  • 10.1016/j.csi.2024.103951
The Use of Large Language Models for Program Repair
  • Nov 24, 2024
  • Computer Standards & Interfaces
  • Fida Zubair + 2 more

Large Language Models (LLMs) have emerged as a promising approach for automated program repair, offering code comprehension and generation capabilities that can address software bugs. Several program repair models based on LLMs have been developed recently. However, findings and insights from these efforts are scattered across various studies, lacking a systematic overview of LLMs' utilization in program repair. Therefore, this Systematic Literature Review (SLR) was conducted to investigate the current landscape of LLM utilization in program repair. This study defined seven research questions and thoroughly selected 41 relevant studies from scientific databases to explore these questions. The results shed light on the diverse capabilities of LLMs for program repair. The findings revealed that Encoder-Decoder architectures emerged as the prevalent LLM design for program repair tasks and that mostly open-access datasets were used. Several evaluation metrics were applied, primarily consisting of accuracy, exact match, and BLEU scores. Additionally, the review investigated several LLM fine-tuning methods, including fine-tuning on specialized datasets, curriculum learning, iterative approaches, and knowledge-intensified techniques. These findings pave the way for further research on utilizing the full potential of LLMs to revolutionize automated program repair.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 106
  • 10.1038/s41746-024-01024-9
CancerGPT for few shot drug pair synergy prediction using large pretrained language models
  • Feb 19, 2024
  • NPJ Digital Medicine
  • Tianhao Li + 6 more

Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3771923
M2CVD: Enhancing Vulnerability Understanding through Multi-Model Collaboration for Code Vulnerability Detection
  • Oct 16, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Ziliang Wang + 6 more

Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, fine-tuned models such as CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) that leverages the strong capability of analyzing vulnerability semantics from LLMs to improve the detection accuracy of fine-tuned models. M2CVD employs a novel collaborative process: first enhancing the quality of vulnerability description produced by LLMs through the understanding of project code by fine-tuned models, and then using these improved vulnerability descriptions to boost the detection accuracy of fine-tuned models. M2CVD include three main phases: 1) Initial Vulnerability Detection: The initial vulnerability detection is conducted by fine-tuning a detection model (e.g., CodeBERT) and interacting with an LLM (e.g., ChatGPT) respectively. The vulnerability description will be generated by the LLM when the code is detected vulnerable by the LLM. 2) Vulnerability Description Refinement: By informing the LLM of the vulnerability assessment results of the detection model, we refine the vulnerability description by interacting with the LLM. Such refinement can enhance LLM’s vulnerability understanding in specific projects, effectively bridging the previously mentioned alignment gap; 3) Integrated Vulnerability Detection: M2CVD integrates code fragment and the refined vulnerability descriptions inferred to form synthetic data. Then, the synthetic data is used to fine-tune a validation model, optimize the defect feature learning efficiency of the model, and improve the detection accuracy. We demonstrated M2CVD’s effectiveness on two real-world datasets, where M2CVD significantly outperformed the baseline. In addition, we demonstrate that the M2CVD collaborative method can extend to other different LLMs and fine-tuned models to improve their accuracy in vulnerability detection tasks.

  • Research Article
  • Cite Count Icon 1
  • 10.2196/76773
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese
  • Jul 8, 2025
  • JMIR Medical Informatics
  • Seiji Shimizu + 4 more

BackgroundDisease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs from training data in writing styles and clinical contexts, has not been thoroughly investigated.ObjectiveThis study evaluated the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was not included during training.MethodsWe used two corpora for this study: (1) a publicly available set of Japanese case reports denoted as CR, and (2) a newly constructed corpus of progress notes, denoted as PN, written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compared a LLM and a baseline-masked language model (MLM). These models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type, similar to training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap (ie, the performance drop from in-domain to out-of-domain settings).ResultsThe LLM demonstrated greater robustness, with a smaller performance gap in F1-scores (ID–OOD = −8.6) compared to the MLM baseline performance (ID–OOD = −13.9). This indicated more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLMs for reliable use in diverse clinical settings.ConclusionsFine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes, with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical natural language processing in low-resource language settings and support their deployment in real-world health care applications, where diversity in documentation is inevitable.

  • Research Article
  • Cite Count Icon 27
  • 10.1145/3709358
Exploring the Capabilities of LLMs for Code-Change-Related Tasks
  • Jul 1, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Lishui Fan + 5 more

Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \(&gt;\) 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but Llama 2 and Code Llama families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.

  • Research Article
  • Cite Count Icon 202
  • 10.1186/s41073-023-00133-5
Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review
  • May 18, 2023
  • Research Integrity and Peer Review
  • Mohammad Hosseini + 1 more

BackgroundThe emergence of systems based on large language models (LLMs) such as OpenAI’s ChatGPT has created a range of discussions in scholarly circles. Since LLMs generate grammatically correct and mostly relevant (yet sometimes outright wrong, irrelevant or biased) outputs in response to provided prompts, using them in various writing tasks including writing peer review reports could result in improved productivity. Given the significance of peer reviews in the existing scholarly publication landscape, exploring challenges and opportunities of using LLMs in peer review seems urgent. After the generation of the first scholarly outputs with LLMs, we anticipate that peer review reports too would be generated with the help of these systems. However, there are currently no guidelines on how these systems should be used in review tasks.MethodsTo investigate the potential impact of using LLMs on the peer review process, we used five core themes within discussions about peer review suggested by Tennant and Ross-Hellauer. These include 1) reviewers’ role, 2) editors’ role, 3) functions and quality of peer reviews, 4) reproducibility, and 5) the social and epistemic functions of peer reviews. We provide a small-scale exploration of ChatGPT’s performance regarding identified issues.ResultsLLMs have the potential to substantially alter the role of both peer reviewers and editors. Through supporting both actors in efficiently writing constructive reports or decision letters, LLMs can facilitate higher quality review and address issues of review shortage. However, the fundamental opacity of LLMs’ training data, inner workings, data handling, and development processes raise concerns about potential biases, confidentiality and the reproducibility of review reports. Additionally, as editorial work has a prominent function in defining and shaping epistemic communities, as well as negotiating normative frameworks within such communities, partly outsourcing this work to LLMs might have unforeseen consequences for social and epistemic relations within academia. Regarding performance, we identified major enhancements in a short period and expect LLMs to continue developing.ConclusionsWe believe that LLMs are likely to have a profound impact on academia and scholarly communication. While potentially beneficial to the scholarly communication system, many uncertainties remain and their use is not without risks. In particular, concerns about the amplification of existing biases and inequalities in access to appropriate infrastructure warrant further attention. For the moment, we recommend that if LLMs are used to write scholarly reviews and decision letters, reviewers and editors should disclose their use and accept full responsibility for data security and confidentiality, and their reports’ accuracy, tone, reasoning and originality.

  • Research Article
  • 10.1016/j.jbi.2026.105034
A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.
  • Mar 1, 2026
  • Journal of biomedical informatics
  • Cheng Peng + 5 more

A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3733599
When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair
  • Feb 13, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Wenqiang Luo + 7 more

Software systems have been evolving rapidly and inevitably introducing bugs at an increasing rate, leading to significant maintenance costs. While large language models (LLMs) have demonstrated remarkable potential in enhancing software development and maintenance practices, particularly in automated program repair (APR), they rely heavily on high-quality code repositories. Most code repositories are proprietary assets that capture the diversity and nuances of real-world industry software practices, which public datasets cannot fully represent. However, obtaining such data from various industries is hindered by data privacy concerns, as companies are reluctant to share their proprietary codebases. There has also been no in-depth investigation of collaborative software development by learning from private and decentralized data while preserving data privacy for program repair. To address the gap, we investigate federated learning as a privacy-preserving method for fine-tuning LLMs on proprietary and decentralized data to boost collaborative software development and maintenance. We use the private industrial dataset TutorCode for fine-tuning and the EvalRepair-Java benchmark for evaluation, and assess whether federated fine-tuning enhances program repair. We then further explore how code heterogeneity (i.e., variations in coding style, complexity, and embedding) and different federated learning algorithms affect bug fixing to provide practical implications for real-world software development collaboration. Our evaluation reveals that federated fine-tuning can significantly enhance program repair, achieving increases of up to 16.67% for Top@10 and 18.44% for Pass@10, even comparable to the bug-fixing capabilities of centralized learning. Moreover, the negligible impact of code heterogeneity implies that industries can effectively collaborate despite diverse data distributions. Different federated algorithms also demonstrate unique strengths across LLMs, suggesting that tailoring the optimization process to specific LLM characteristics can further improve program repair.

  • Research Article
  • 10.1200/jco.2024.42.16_suppl.e13630
Performance of three commercially available large language models and one locally fine-tuned model at preparing formal letters to appeal medical insurance denials of radiotherapy services.
  • Jun 1, 2024
  • Journal of Clinical Oncology
  • Kendall Kiser + 4 more

e13630 Background: As many as 60% of prior authorization requests are denied, yet coverage approval occurs for more than 60% of appeals for some therapies. Appeal processes encumber providers and increase burnout, but large language models (LLMs) may aid providers by drafting appeal letters. We evaluated LLM performance at this task for radiotherapy denials. Methods: Three commercially accessible LLMs were evaluated: generative pre-trained transformer 3.5 (GPT3.5), GPT4, and GPT4+web with internet search capacity (OpenAI, Inc., San Francisco, CA). A fourth LLM, GPT3.5-FT, was developed by fine-tuning GPT3.5 in a HIPAA-complaint local environment. The fine-tuning training data comprised 53 insurance denial appeal letters prepared by radiation oncologists and paired prompts describing the clinical history and appeal intent. Training data were enriched in appeal letters for proton radiotherapy, stereotactic body radiotherapy, and image-guided radiotherapy for myriad clinical scenarios. Twenty prompts, each requesting a letter for a simulated patient history, were programmatically presented to the LLMs. Three radiation oncologists, who were blinded to the LLM source, scored letter outputs across four domains: language syntax and semantics, clinical detail inclusion, clinical reasoning validity, and overall readiness for insurer submission. Additionally, one radiation oncologist scored the authenticity and relevance of literature sources cited in output letters, which were requested by several test prompts. Interobserver agreement between radiation oncologist scores was determined by Cohen’s kappa coefficient. Scores were compared between LLMs with non-parametric statistical tests. Results: Agreement between radiation oncologists’ scores was moderate-to-excellent across all domains (median κ = 0.68, minimum κ = 0.41). GPT3.5, GPT4, and GPT4+web drafted letters that, by mode average, were semantically and syntactically clear, included all provided clinical history without confabulation, clinically reasoned with few necessary revisions, and overall were submissible to an insurer with minor revisions. GPT4 and GPT4+web clinically reasoned better than GPT3.5 (p values &lt; 0.001). In contrast, GPT3.5-FT performance was inferior to other LLMs across all domains (p values &lt; 0.001). LLMs were poor at identifying, citing, and summarizing relevant literature unless provided in the prompt. Conclusions: LLMs can draft insurance appeal letters for radiotherapy services that require few revisions yet are poor at referencing relevant literature. Contrary to our hypothesis, fine-tuning with data from our department compromised LLM performance.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3639279
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models
  • Mar 12, 2024
  • Proceedings of the ACM on Management of Data
  • Arash Dargahi Nobari + 1 more

Many organizations rely on data from government and third-party sources, and those sources rarely follow the same data formatting. This introduces challenges in integrating data from multiple sources or aligning external sources with internal databases. Commercial database systems do not offer adequate support for integrating data from heterogeneous sources, and manual integration is both time-consuming and inefficient. State-of-the-art data integration approaches that rely on similarity functions and textual transformations often fail to handle challenging cases where multiple mappings are required, or the mappings go beyond simple textual transformations. In this paper, we study the potentials of deep neural models for transforming tables for joinability. In particular, we cast the problem as a prediction task and develop a framework that leverages large deep-learning language models to transform tabular data from a source formatting to a desired target representation. Our framework can efficiently learn the patterns for mapping a source formatting into an expected target using just a few examples, which can then be used for tasks such as table joining, filling in missing values, and error detection. Compared to state-of-the-art mapping and joining approaches, our framework delivers noticeably more accurate and scalable performance on both real-world and synthetic datasets. Our experimental evaluation also shows that the performance of the proposed framework using our fine-tuned model is at par or better than large language models such as GPT-3, despite the significant difference in size, and that using large language models within our framework improves their performance.

  • Research Article
  • Cite Count Icon 2
  • 10.1002/spe.70027
Exploring Influence Factors on LLM Suitability for No‐Code Development of End User Applications
  • Oct 16, 2025
  • Software: Practice and Experience
  • Minghe Wang + 4 more

Context/Problem Statement No‐Code Development Platforms (NCDPs) empower non‐technical end users to build applications tailored to their specific demands without writing code. While NCDPs lower technical barriers, users still require some technical knowledge, for example, to structure process steps or define event‐action rules. Large Language Models (LLMs) offer a promising solution to further reduce technical requirements by supporting natural language interaction and dynamic code generation. By integrating LLMs, NCDPs can be more accessible to non‐technical users, enabling application development truly without requiring any technical expertise. Despite growing interest in LLM‐powered NCDPs, a systematic investigation into the factors influencing LLM suitability and performance remains absent. Understanding these factors is critical to effectively leveraging LLMs capabilities and maximizing their impact. Objective In this paper, we aim to investigate key factors influencing the effectiveness of LLMs in supporting end‐user application development within NCDPs. Methods We conducted comprehensive experiments evaluating four key factors, i.e., model selection, prompt language, training data background, and an error‐informed few‐shot setup, on the quality of generated applications. Specifically, we selected a range of LLMs based on architecture, scale, design focus, and training data, and evaluated them across four real‐world smart home automation scenarios implemented on a representative open‐source LLM‐powered NCDP. Results Model selection emerged as the most critical factor influencing performance. General‐purpose LLMs with strong natural language understanding generally outperformed others. Prompt language effects varied by model and task complexity: original prompts worked best for advanced multilingual LLMs, whereas translation steps improved performance for lighter or less capable models. LLMs showcased outperforming performance when their linguistic background aligned with the prompt language. In addition, incorporating an error‐informed few‐shot approach enhanced LLM performance, particularly for coding‐oriented and medium‐performing models, though its benefits were secondary to model choice and required additional engineering effort. Conclusion Our findings provide practical insights into how LLMs can be effectively integrated into NCDPs, informing both platform design and the selection of suitable LLMs for end‐user application development.

  • Discussion
  • Cite Count Icon 1
  • 10.1002/pros.24748
Responses to queries concerning "Performance of large language models on benign prostatic hyperplasia frequently asked questions".
  • May 16, 2024
  • The Prostate
  • Yuning Zhang + 1 more

We thank Hinpetch Daungsupawong and Viroj Wiwanitkit for their interest in our work.1 Our study categorized the responses generated by the three different Large language model (LLMs) into four grades based on correctness and comprehensiveness. With this definition, we use the accuracy rate as the main indicator for LLMs' performance. However, as they mentioned, relying solely on the accuracy rate to assess LLMs' performance is limited and incomplete. Other indicators, such as specificity and the depth of responses generated by LLMs, are also important for evaluating their performance. Therefore, in subsequent related studies, we will consider incorporating specificity and depth into the criteria for grading answers or using them as additional indicators for assessing LLMs' performance. We agree with their opinions that any biases or restrictions in the training data that the LLMs were developed using could have an impact on the replies' accuracy and dependability, however, due to the inaccessibility of the data sets that the LLMs were trained on, an assessment in this regard is difficult to do at this time. In addition, their comments on the future direction of LLMs research, such as evaluating the efficacy of LLMs in answering more complex questions, how to improve the reproducibility of LLMs, and developing standards to make LLM-generated content more ethical and transparent, are all very valuable and worth thinking about. Not limited to us, we feel that all researchers interested in the application of LLMs in medicine should fully consider their valuable opinions in future research.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant