The Use of Large Language Models for Program Repair

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Large Language Models (LLMs) have emerged as a promising approach for automated program repair, offering code comprehension and generation capabilities that can address software bugs. Several program repair models based on LLMs have been developed recently. However, findings and insights from these efforts are scattered across various studies, lacking a systematic overview of LLMs' utilization in program repair. Therefore, this Systematic Literature Review (SLR) was conducted to investigate the current landscape of LLM utilization in program repair. This study defined seven research questions and thoroughly selected 41 relevant studies from scientific databases to explore these questions. The results shed light on the diverse capabilities of LLMs for program repair. The findings revealed that Encoder-Decoder architectures emerged as the prevalent LLM design for program repair tasks and that mostly open-access datasets were used. Several evaluation metrics were applied, primarily consisting of accuracy, exact match, and BLEU scores. Additionally, the review investigated several LLM fine-tuning methods, including fine-tuning on specialized datasets, curriculum learning, iterative approaches, and knowledge-intensified techniques. These findings pave the way for further research on utilizing the full potential of LLMs to revolutionize automated program repair.

Similar Papers
  • Research Article
  • 10.1609/aaai.v39i1.32046
Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Pedro Orvalho + 2 more

Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on programming assignments requires substantial time and effort from faculty, personalized automated feedback often involves suggesting repairs to students' programs. Symbolic semantic repair approaches, which rely on Formal Methods (FM), check a program's execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for program repair but often make extensive rewrites instead of minimal adjustments. This tends to lead to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FM-based fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a Counterexample Guided Inductive Synthesis (CEGIS) loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM for revised synthesis. Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs and produce smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.

  • Research Article
  • 10.1145/3733599
When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair
  • May 1, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Wenqiang Luo + 7 more

Software systems have been evolving rapidly and inevitably introducing bugs at an increasing rate, leading to significant maintenance costs. While large language models (LLMs) have demonstrated remarkable potential in enhancing software development and maintenance practices, particularly in automated program repair (APR), they rely heavily on high-quality code repositories. Most code repositories are proprietary assets that capture the diversity and nuances of real-world industry software practices, which public datasets cannot fully represent. However, obtaining such data from various industries is hindered by data privacy concerns, as companies are reluctant to share their proprietary codebases. There has also been no in-depth investigation of collaborative software development by learning from private and decentralized data while preserving data privacy for program repair. To address the gap, we investigate federated learning as a privacy-preserving method for fine-tuning LLMs on proprietary and decentralized data to boost collaborative software development and maintenance. We use the private industrial dataset TutorCode for fine-tuning and the EvalRepair-Java benchmark for evaluation, and assess whether federated fine-tuning enhances program repair. We then further explore how code heterogeneity (i.e., variations in coding style, complexity, and embedding) and different federated learning algorithms affect bug fixing to provide practical implications for real-world software development collaboration. Our evaluation reveals that federated fine-tuning can significantly enhance program repair, achieving increases of up to 16.67% for Top@10 and 18.44% for Pass@10, even comparable to the bug-fixing capabilities of centralized learning. Moreover, the negligible impact of code heterogeneity implies that industries can effectively collaborate despite diverse data distributions. Different federated algorithms also demonstrate unique strengths across LLMs, suggesting that tailoring the optimization process to specific LLM characteristics can further improve program repair.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3719345
ContrastRepair : Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs
  • Mar 4, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Jiaolong Kong + 5 more

Automated Program Repair (APR) aims to automatically generate patches for rectifying software bugs. Recent strides in Large Language Models (LLM), such as ChatGPT, have yielded encouraging outcomes in APR, especially within the conversation-driven APR framework. Nevertheless, the efficacy of conversation-driven APR is contingent on the quality of the feedback information. In this paper, we propose ContrastRepair , a novel conversation-based APR approach that augments conversation-driven APR by providing LLMs with contrastive test pairs. A test pair consists of a failing test and a passing test, which offer contrastive feedback to the LLM. Our key insight is to minimize the difference between the generated passing test and the given failing test, which can better isolate the root causes of bugs. By providing such informative feedback, ContrastRepair enables the LLM to produce effective bug fixes. The implementation of ContrastRepair is based on the state-of-the-art LLM, ChatGPT, and it iteratively interacts with ChatGPT until plausible patches are generated. We evaluate ContrastRepair on multiple benchmark datasets, including Defects4J, QuixBugs, and HumanEval-Java. The results demonstrate that ContrastRepair significantly outperforms existing methods, achieving a new state-of-the-art in program repair. For instance, among Defects4J 1.2 and 2.0, ContrastRepair correctly repairs 143 out of all 337 bug cases, while the best-performing baseline fixes 124 bugs.

  • Research Article
  • 10.54254/2755-2721/2025.22701
Systematically Understanding of Code Semantic Interpretation for LLMs
  • May 15, 2025
  • Applied and Computational Engineering
  • Jia Zhao

Understanding and interpreting code is a crucial task in intelligent software engineering, aiding developers and users in adjusting code for correctness and robustness. The emergence of large language models (LLMs) provides new perspectives for code interpretation tasks. However, current LLM-based code interpretation remains restricted to limited dimensions, lacks a unified evaluation standard, and is missing a comprehensive and systematic assessment methodology. To address this issue, this paper proposes an LLM code understanding evaluation method based on a multi-granularity voting mechanism, aiming to systematically investigate and analyze LLMs' performance in code interpretation tasks. First, we carefully select code snippets from open-source GitHub projects and preprocess them for LLM analysis. Second, we use identical prompts and inputs to test three popular LLMs, recording their output. During this process, we apply prompt engineering techniques to specific target code snippets and conduct repeated experiments to explore the impact of prompt engineering on LLM-generated code explanations. Next, we design evaluation metrics to quantify the LLM outputs and assess their effectiveness based on the obtained scores. Experimental results demonstrate significant differences in code analysis and generation capabilities among the evaluated general-purpose LLMs from different vendors when given identical prompts and inputs. When multiple dimensions are considered in evaluating the generated content, different LLMs exhibit varying strengths in different aspects. Additionally, applying specific prompt engineering techniques can moderate the discrepancies in code analysis and generation capabilities among different LLMs.

  • Research Article
  • Cite Count Icon 23
  • 10.1145/3660773
A Deep Dive into Large Language Models for Automated Bug Localization and Repair
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Soneya Binta Hossain + 7 more

Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR). In this study, we take a deep dive into automated bug localization and repair utilizing LLMs. In contrast to many deep learning-based APR methods that assume known bug locations, rely on line-level localization tools, or address bug prediction and fixing in one step, our approach uniquely employs LLMs to predict bug location at the token level and subsequently utilizes them for bug fixing. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information and improved incorporation of inductive biases. We introduce Toggle: Token-Granulated Bug Localization and Repair, a comprehensive program repair framework that integrates a bug localization model, an adjustment model to address tokenizer inconsistencies, and a bug-fixing model. Toggle takes a buggy function as input and generates a complete corrected function. We investigate various styles of prompting to the bug fixing model to identify the most effective prompts that better utilize the inductive bias and significantly outperform others. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark, and exhibits better and comparable performance on several other widely-used APR datasets, including Defects4J. In the Defects4J benchmark, our approach consistently ranks above other methods, achieving superior results in the Top-10, Top-30, Top-50, and Top-100 metrics. Besides examining Toggle’s generalizability to unseen data, evaluating the effectiveness of various prompts, we also investigate the impact of additional contextual information such as buggy lines and code comments on bug localization, and explore the importance of the adjustment model. Our extensive experiments offer valuable insights and answers to critical research questions.

  • Research Article
  • 10.54254/2755-2721/2024.18303
Large Language Models Meet Automated Program Repair: Innovations, Challenges and Solutions
  • Dec 19, 2024
  • Applied and Computational Engineering
  • Yiting Tang

As the field of Automated Program Repair (APR) continues to evolve, traditional Neural Program Repair (NPR) methods, while successful in low-resource computing scenarios, still confront numerous challenges, including the demand for extensive training data, the limited generality of specially designed networks, and a lack of robustness. In recent years, Large Language Models (LLMs) have demonstrated remarkable efficacy in downstream code-related tasks, thanks to their potent comprehension and text generation capabilities, gradually emerging as pivotal tools in automated program repair. Compared to NPR techniques, LLM-based APRs exhibit superior repair performance and enhanced generality, leading to their increasing adoption in APR tasks. Currently, the performance of zero-shot LLM-based APRs has surpassed that of NPR. LLM-based APRs have issues, such as excessive fine-tuning costs, data leakage concerns, and a shortage of domain-specific knowledge. This paper aims to review and summarize the latest advancements in LLM-based APRs from the perspectives of innovation, challenges, and solutions, providing researchers with profound insights and future directions.

  • Conference Article
  • 10.54941/ahfe1006042
Leveraging LLMs to emulate the design processes of different cognitive styles
  • Jan 1, 2025
  • Xiyuan Zhang + 5 more

Cognitive styles, which shape designers’ thinking, problem-solving, and decision-making, influence strategies and preferences in design tasks. In team collaboration, diversity cognitive styles enhance problem-solving efficiency, foster creativity, and improve team performance.The ‘Co-evolution of problem–solution’ model serves as a key theoretical framework for understanding differences in designers’ cognitive styles. Based on this model, designers can be categorized into two cognitive styles: problem-driven and solution-driven. Problem-driven designers prioritize structuring the problem before developing solutions, while solution-driven designers generate solutions when design problems still ill-defined, and then work backward to define the problem. Designers with different expertise and disciplinary backgrounds exhibit distinct cognitive style tendencies. Different cognitive styles also adapt differently to design tasks, excelling in some more than others.As a rapidly advancing technology, large language models (LLMs) have shown considerable potential in the field of design. Their powerful generative capabilities position them as potential collaborators in design teams, emulating different cognitive styles. These emulations aim to bridge cognitive differences among team members, enable designers to leverage their individual strengths, and ultimately produce more feasible and high-quality design solutions.However, previous studies have been limited in leveraging LLMs to directly generate design outcomes based on different cognitive styles, neglecting the emulation of the design process itself. In fact, the evolutionary development between problem and solution spaces better reflects the core differences in cognitive styles. Moreover, communication and collaboration within design teams extend beyond simply exchanging solutions, but span multiple stages of the design process—from problem analysis, idea generation, to evaluation. To better integrate LLMs into design teams, it is necessary to consider the emulation of the design cognition process.To this end, our study, based on the cognitive style taxonomy proposed by Dorst and Cross (2001), explores how LLMs can be used to emulate the design processes of problem-driven and solution-driven designers. We develop a zero-shot chain-of-thought (CoT)-based prompting strategy that enables LLMs to emulate the step-by-step cognitive flow of both design styles. The prompt design is inspired by Jiang et al. (2014) and Chen et al. (2023), who analyzed cognitive differences in conceptual design process using the FBS ontology model. Furthermore, to evaluate the effectiveness of LLMs in emulating cognitive styles, this study establishes a three-dimentional evaluation metrics: static distribution (the proportion and preference of cognitive issues), dynamic transformation (behavioral transition patterns), and the creativity of the design outcomes. Using previous studies identified human design behaviours as a benchmark, we compare the cognitive styles emulated by LLMs under different design constraints against human performance to assess their alignment and differences.The results show that LLM-generated design processes align well with human cognitive styles, effectively emulate static cognitive characteristics. Moreover, enhancing novelty and integrity in solutions and demonstrating superior creativity compared to baseline methods. However, LLMs lack the fully complex nonlinear transitions between problem and solution spaces observed in human designers.This process-based emulation has the potential to enhance the application of LLMs in design teams, enabling them to not only serve as tools for generating solutions but also provide support for collaboration during key stages of the design process. Future research should enhance LLMs' reasoning flexibility through fine-tuning or the GoT approach and explore their impact on human-AI collaboration across diverse design tasks to refine their role in design teams.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/aerospace12060498
Using Large Language Models for Aerospace Code Generation: Methods, Benchmarks, and Potential Values
  • May 30, 2025
  • Aerospace
  • Rui He + 4 more

In recent years, Large Language Models (LLMs) have witnessed rapid advancements, revolutionizing various domains. Within the realm of software development, code generation technology powered by LLMs has emerged as a prominent research focus. Despite its potential, the application of this technology in the aerospace sector remains in its nascent, exploratory phase. This paper delves into the intricacies of LLM-based code generation methods and explores their potential applications in aerospace contexts. It introduces RepoSpace, the pioneering warehouse-level benchmark test for code generation of spaceborne equipment. Comprising 825 samples from five actual projects, this benchmark offers a more precise evaluation of LLMs’ capabilities in aerospace scenarios. Through extensive evaluations of seven state-of-the-art LLMs on RepoSpace, the study reveals that domain-specific differences significantly impact the code generation performance of LLMs. Existing LLMs exhibit subpar performance in specialized warehouse-level code generation tasks for aerospace, with their performance markedly lower than that of domain tasks. The research further demonstrates that Retrieval Augmented Generation (RAG) technology can effectively enhance LLMs’ code generation capabilities. Additionally, the use of appropriate prompt templates can guide the models to achieve superior results. Moreover, high-quality documentation strings are found to be crucial in improving LLMs’ performance in warehouse-level code generation tasks. This study provides a vital reference for leveraging LLMs for code generation in the aerospace field, thereby fostering technological innovation and progress in this critical domain.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/apsec57359.2022.00020
Systematic Analysis of Defect-Specific Code Abstraction for Neural Program Repair
  • Dec 1, 2022
  • Kicheol Kim + 2 more

Automated program repair(APR) is in the spotlight in academia and the field to reduce the time and cost of maintenance for developers. Recently, APR has continued to study based on deep-learning models to understand and learn how to fix software bugs. Text-to-Text Transfer Transformer(T5), which scored state-of-the-art in natural language processing benchmarks, also showed promising results on program repair in recent studies. In deep-learning-based program repair studies, studies commonly propose code abstraction techniques to avoid vocabulary problems and learn fine code transformation to generate bug-fixing patches. However, there is not enough systematic analysis of code abstraction according to each bug type in deep-learning-based program repair. Therefore, We leverage TFix, T5-based program repair, to evaluate how code abstraction techniques affect neural program repair. Our experimental results showed that defect-specific code abstraction achives a higher average BLEU score than the existing code abstraction technique in both T5 and multilingual-T5(mT5) model-based TFix results. Also, mT5 model-based TFix, which is applied defect-specific code abstraction, gets a higher BLEU score in 37 error types of 52 ESLint error types than TFix.

  • Research Article
  • Cite Count Icon 13
  • 10.3390/app14199103
Automating Systematic Literature Reviews with Retrieval-Augmented Generation: A Comprehensive Overview
  • Oct 9, 2024
  • Applied Sciences
  • Binglan Han + 2 more

This study examines Retrieval-Augmented Generation (RAG) in large language models (LLMs) and their significant application for undertaking systematic literature reviews (SLRs). RAG-based LLMs can potentially automate tasks like data extraction, summarization, and trend identification. However, while LLMs are exceptionally proficient in generating human-like text and interpreting complex linguistic nuances, their dependence on static, pre-trained knowledge can result in inaccuracies and hallucinations. RAG mitigates these limitations by integrating LLMs’ generative capabilities with the precision of real-time information retrieval. We review in detail the three key processes of the RAG framework—retrieval, augmentation, and generation. We then discuss applications of RAG-based LLMs to SLR automation and highlight future research topics, including integration of domain-specific LLMs, multimodal data processing and generation, and utilization of multiple retrieval sources. We propose a framework of RAG-based LLMs for automating SRLs, which covers four stages of SLR process: literature search, literature screening, data extraction, and information synthesis. Future research aims to optimize the interaction between LLM selection, training strategies, RAG techniques, and prompt engineering to implement the proposed framework, with particular emphasis on the retrieval of information from individual scientific papers and the integration of these data to produce outputs addressing various aspects such as current status, existing gaps, and emerging trends.

  • PDF Download Icon
  • Preprint Article
  • 10.2196/preprints.68320
Knowledge Enhancement of Small-Scale Models in Medical Question Answering (Preprint)
  • Nov 3, 2024
  • Xinbai Li + 3 more

BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.2196/59641
Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
  • Aug 29, 2024
  • JMIR infodemiology
  • Michael S Deiner + 5 more

Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.

  • Research Article
  • Cite Count Icon 9
  • 10.1007/s10515-025-00492-x
A study on prompt design, advantages and limitations of ChatGPT for deep learning program repair
  • Mar 7, 2025
  • Automated Software Engineering
  • Jialun Cao + 3 more

The emergence of large language models (LLMs) such as ChatGPT has revolutionized many fields. In particular, recent advances in LLMs have triggered various studies examining the use of these models for software development tasks, such as program repair, code understanding, and code generation. Prior studies have shown the capability of ChatGPT in repairing conventional programs. However, debugging deep learning (DL) programs poses unique challenges since the decision logic is not directly encoded in the source code. This requires LLMs to not only parse the source code syntactically but also understand the intention of DL programs. Therefore, ChatGPT’s capability in repairing DL programs remains unknown. To fill this gap, our study aims to answer three research questions: (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT’s repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? Our study analyzes the typical information that is useful for prompt design and suggests enhanced prompt templates that are more efficient for repairing DL programs. On top of them, we summarize the dual perspectives (i.e., advantages and disadvantages) of ChatGPT’s ability, such as its handling of API misuse and recommendation, and its shortcomings in identifying default parameters. Our findings indicate that ChatGPT has the potential to repair DL programs effectively and that prompt engineering and dialogue can further improve its performance by providing more code intention. We also identified the key intentions that can enhance ChatGPT’s program repairing capability.

  • Research Article
  • Cite Count Icon 4
  • 10.1609/aaai.v37i13.26879
Exploring Social Biases of Large Language Models in a College Artificial Intelligence Course
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Skylar Kolisko + 1 more

Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.

  • Research Article
  • 10.1093/ndt/gfae069.792
#2924 Comparison of large language models and traditional natural language processing techniques in predicting arteriovenous fistula failure
  • May 23, 2024
  • Nephrology Dialysis Transplantation
  • Suman Lama + 6 more

Background and Aims Large language models (LLMs) have gained significant attention in the field of natural language processing (NLP), marking a shift from traditional techniques like Term Frequency-Inverse Document Frequency (TF-IDF). We developed a traditional NLP model to predict arteriovenous fistula (AVF) failure within next 30 days using clinical notes. The goal of this analysis was to investigate whether LLMs would outperform traditional NLP techniques, specifically in the context of predicting AVF failure within the next 30 days using clinical notes. Method We defined AVF failure as the change in status from active to permanently unusable status or temporarily unusable status. We used data from a large kidney care network from January 2021 to December 2021. Two models were created using LLMs and traditional TF-IDF technique. We used “distilbert-base-uncased”, a distilled version of BERT base model [1], and compared its performance with traditional TF-IDF-based NLP techniques. The dataset was randomly divided into 60% training, 20% validation and 20% test dataset. The test data, comprising of unseen patients’ data was used to evaluate the performance of the model. Both models were evaluated using metrics such as area under the receiver operating curve (AUROC), accuracy, sensitivity, and specificity. Results The incidence of 30 days AVF failure rate was 2.3% in the population. Both LLMs and traditional showed similar overall performance as summarized in Table 1. Notably, LLMs showed marginally better performance in certain evaluation metrics. Both models had same AUROC of 0.64 on test data. The accuracy and balanced accuracy for LLMs were 72.9% and 59.7%, respectively, compared to 70.9% and 59.6% for the traditional TF-IDF approach. In terms of specificity, LLMs scored 73.2%, slightly higher than the 71.2% observed for traditional NLP methods. However, LLMs had a lower sensitivity of 46.1% compared to 48% for traditional NLP. However, it is worth noting that training on LLMs took considerably longer than TF-IDF. Moreover, it also used higher computational resources such as utilization of graphics processing units (GPU) instances in cloud-based services, leading to higher cost. Conclusion In our study, we discovered that advanced LLMs perform comparably to traditional TF-IDF modeling techniques in predicting the failure of AVF. Both models demonstrated identical AUROC. While specificity was higher in LLMs compared to traditional NLP, sensitivity was higher in traditional NLP compared to LLMs. LLM was fine-tuned with a limited dataset, which could have influenced its performance to be similar to that of traditional NLP methods. This finding suggests that while LLMs may excel in certain scenarios, such as performing in-depth sentiment analysis of patient data for complex tasks, their effectiveness is highly dependent on the specific use case. It is crucial to weigh the benefits against the resources required for LLMs, as they can be significantly more resource-intensive and costly compared to traditional TF-IDF methods. This highlights the importance of a use-case-driven approach in selecting the appropriate NLP technique for healthcare applications.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.