Understanding Code Changes Practically with Small-Scale Language Models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Recent studies indicate that traditional techniques for understanding code changes are not as effective as techniques that directly prompt language models (LMs). However, current LM-based techniques heavily rely on expensive, large LMs (LLMs) such as GPT-4 and Llama-13b, which are either commercial or prohibitively costly to deploy on a wide scale, thereby restricting their practical applicability. This paper explores the feasibility of deploying small LMs (SLMs) while maintaining comparable or superior performance to LLMs in code change understanding. To achieve this, we created a small yet high-quality dataset called HQCM which was meticulously reviewed, revised, and validated by five human experts. We fine-tuned state-of-the-art 7b and 220m SLMs using HQCM and compared them with traditional techniques and LLMs with ≥70b parameters. Our evaluation confirmed HQCM's benefits and demonstrated that SLMs, after finetuning by HQCM, can achieve superior performance in three change understanding tasks: change summarization, change classification, and code refinement. This study supports the use of SLMs in environments with security, computational, and financial constraints, such as in industry scenarios and on edge devices, distinguishing our work from the others.

Similar Papers
  • Research Article
  • Cite Count Icon 43
  • 10.1016/j.infsof.2021.106566
Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model
  • Mar 10, 2021
  • Information and Software Technology
  • Lobna Ghadhab + 3 more

Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model

  • Preprint Article
  • 10.2196/preprints.68320
Knowledge Enhancement of Small-Scale Models in Medical Question Answering (Preprint)
  • Nov 3, 2024
  • Xinbai Li + 3 more

BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3735129
MORepair : Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning
  • Jan 21, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Boyang Yang + 7 more

Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.

  • Research Article
  • 10.52710/cfs.978
ML-Powered Root Cause Analysis and Automated Remediation for Java Microservices
  • Mar 13, 2026
  • Computer Fraud and Security
  • Tejendra Patel

It is far from adequate to detect performance regressions in production Java microservices without proper attribution and resolution, especially in large, rapidly changing codebases without wide-ranging human involvement. This is challenging in modern continuous delivery environments, where multiple commits are bundled into a release and the root cause is inferred from heterogeneous streams of performance telemetry, version control, and incident history. In this article, let’s build an end-to-end system for code change analysis with multi-modal feature engineering, gradient-boosted tree classification, SHAP-based explanations, and large language model code generation. And design an ensemble XGBoost model that learns the non-linear mapping from code change to runtime impact. By using SHAP values in order to give theoretically principled, plain-language relevance explanations that ensure engineer trust that these models are calibrated. A LoRA fine-tuned GPT-4 model then writes production-ready code changes through an AI-orchestrated pull request workflow, with human approval and staged deployment verification remaining mandatory as gates. The automation becomes an accelerant to engineering judgment rather than a substitute for it. The system is continuously retrained based on feedback from engineers to accommodate codebase changes.

  • Research Article
  • Cite Count Icon 26
  • 10.1145/3709358
Exploring the Capabilities of LLMs for Code-Change-Related Tasks
  • Jul 1, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Lishui Fan + 5 more

Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \(>\) 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but Llama 2 and Code Llama families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.

  • Conference Article
  • Cite Count Icon 668
  • 10.1145/3394486.3406703
DeepSpeed
  • Aug 20, 2020
  • Jeff Rasley + 3 more

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), which at the time of its release was the largest publicly known language model at 17 billion parameters. In addition we will also go over our latest transformer kernel advancements that led the DeepSpeed team to achieve the world fastest BERT pretraining record. The Zero Redundancy Optimizer (ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with over 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology. DeepSpeed brings state-of-the-art training techniques, such as ZeRO, optimized kernels, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch. With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address underlying performance challenges and boost the speed and scale of your training.

  • Research Article
  • 10.1145/3728944
PatchScope: LLM-Enhanced Fine-Grained Stable Patch Classification for Linux Kernel
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Rongkai Liu + 8 more

Stable patch classification plays a crucial role in vulnerability management for the Linux kernel, significantly contributing to the stability and security of Long-term support(LTS) versions. Although existing tools have effectively assisted in assessing whether patches should be merged into stable versions, they cannot determine which stable patches should be merged into which LTS versions. This process still requires the maintainers of the distribution community to manually screen based on the requirements of their respective versions.To address this issue, we propose PatchScope, which is designed to predict the specific merge status of patches.Patchscope consists of two components: patch analysis and patch classification.Patch analysis leverages Large Language Models(LLMs) to generate detailed patch descriptions from the commit message and code changes, thereby deepening the model's semantic understanding of patches. Patch classification utilizes a pre-trained language model to extract semantic features of the patches and employs a two-stage classifier to predict the merge status of the patches.The model is optimized using the dynamic weighted loss function to handle data imbalance and improve overall performance.Given that the primary focus is maintaining Linux kernel versions 5.10 and 6.6, we have conducted comparative experiments based on these two versions. Experimental results demonstrate that Patchscope can effectively predict the merge status of patches.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3524842.3528518
Detecting privacy-sensitive code changes with language modeling
  • May 23, 2022
  • Gökalp Demirci + 4 more

At Meta, we work to incorporate privacy-by-design into all of our products and keep user information secure. We have created an ML model that detects code changes ("diffs") that have privacy-sensitive implications. At our scale of tens of thousands of engineers creating hundreds of thousands of diffs each month, we use automated tools for detecting such diffs. Inspired by recent studies on detecting defects [2, 3, 5] and security vulnerabilities [4, 6, 7], we use techniques from natural language processing to build a deep learning system for detecting privacy-sensitive code.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 66
  • 10.5194/gmd-5-87-2012
The 1-way on-line coupled atmospheric chemistry model system MECO(n) – Part 1: Description of the limited-area atmospheric chemistry model COSMO/MESSy
  • Jan 19, 2012
  • Geoscientific Model Development
  • A Kerkweg + 1 more

Abstract. The numerical weather prediction model of the Consortium for Small Scale Modelling (COSMO), maintained by the German weather service (DWD), is connected with the Modular Earth Submodel System (MESSy). This effort is undertaken in preparation of a new, limited-area atmospheric chemistry model. Limited-area models require lateral boundary conditions for all prognostic variables. Therefore the quality of a regional chemistry model is expected to improve, if boundary conditions for the chemical constituents are provided by the driving model in consistence with the meteorological boundary conditions. The new developed model is as consistent as possible, with respect to atmospheric chemistry and related processes, with a previously developed global atmospheric chemistry general circulation model: the ECHAM/MESSy Atmospheric Chemistry (EMAC) model. The combined system constitutes a new research tool, bridging the global to the meso-γ scale for atmospheric chemistry research. MESSy provides the infrastructure and includes, among others, the process and diagnostic submodels for atmospheric chemistry simulations. Furthermore, MESSy is highly flexible allowing model setups with tailor made complexity, depending on the scientific question. Here, the connection of the MESSy infrastructure to the COSMO model is documented and also the code changes required for the generalisation of regular MESSy submodels. Moreover, previously published prototype submodels for simplified tracer studies are generalised to be plugged-in and used in the global and the limited-area model. They are used to evaluate the TRACER interface implementation in the new COSMO/MESSy model system and the tracer transport characteristics, an important prerequisite for future atmospheric chemistry applications. A supplementary document with further details on the technical implementation of the MESSy interface into COSMO with a complete list of modifications to the COSMO code is provided.

  • Research Article
  • Cite Count Icon 1
  • 10.5281/zenodo.4266643
Replication Package of Augmenting Commit Classification by using Fine-Grained Source Code Changes and a Pre-trained Deep Neural Language Model
  • May 7, 2020
  • Zenodo (CERN European Organization for Nuclear Research)
  • Lobna Ghadhab

Replication Package of Augmenting Commit Classification by using Fine-Grained SourceCode Changes and a Pre-trained Deep Neural Language Model

  • Research Article
  • 10.51244/ijrsi.2025.120700027
AI-Driven Developer Ecosystem
  • Jan 1, 2025
  • International Journal of Research and Scientific Innovation
  • Prof Swathi Srikanth + 4 more

The advent of Large Language Models (LLMs), including tools like GitHub Copilot and OpenAI Codex, has brought substantial changes to the field of software engineering. These technologies support developers through features such as automated code generation, smart code suggestions, and productivity enhancements. Despite these advancements, the development workflow is still scattered across multiple standalone tools used for coding, testing, documentation, and team communication. This lack of integration disrupts the development flow and negatively impacts overall team efficiency. To address these challenges, this paper proposes the AI-Driven Developer Ecosystem (AIDE)—a comprehensive development framework that harnesses the capabilities of LLMs while addressing gaps in tool interoperability and contextual awareness. AIDE functions as a unified, intelligent development environment that offers AI-assisted coding, predictive insights for continuous integration and deployment (CI/CD), automated issue classification, adaptive system architecture analysis, and harmonized documentation tools. AIDE sets itself apart from conventional development environments by providing continuous, context-aware support. It does this by analyzing real-time code changes, historical data, and team collaboration behavior. The platform also integrates collaborative tools such as Excalidraw for visual planning and embedded communication features for real-time coordination, promoting a deeply collaborative development experience that extends beyond code writing. By drawing from current academic research and industry practices, this paper illustrates how AIDE effectively addresses critical issues in intelligent software development, resulting in better code quality, minimized downtime, and increased developer satisfaction.

  • Research Article
  • 10.54364/aaiml.2024.44171
Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models
  • Jan 1, 2024
  • Advances in Artificial Intelligence and Machine Learning
  • Muna Al-Razgan + 3 more

Version Control Systems (VCS) manage source code changes by storing modifications in a database. A key feature of VCS is the commit function, which saves the project’s current state and summarizes changes through Commit Message (CM). These messages are vital for collaboration, particularly in open-source artificial intelligence (AI) projects on platforms, where contributors work on rapidly evolving codebases. This paper presents an empirical analysis of CM within open-source AI repositories on GitHub, focusing on their content, the effectiveness of categorization by Large Language Models (LLMs), and the impact of message quality on categorization accuracy. A sample of 384 CMs from 34 repositories was manually categorized to establish a taxonomy. Python was then used for automated keyword extraction, refined with regex patterns. Also, an experiment involved assessing the performance of ChatGPT-4 in categorizing CMs, first without guidance and later using our developed taxonomy. Our findings indicate that the quality of CMs varies greatly, which has a clear impact on how efficiently they can be categorized. This study contributes to the field by providing a structured taxonomy of CMs and exploring how tools like ChatGPT-4 can be used to analyze them. The insights from this research are intended to benefit both academic studies and real-world software development, particularly by helping teams better understand and automate the handling of CM in AI projects.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3728961
VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Jiawei Guo + 2 more

Release notes are essential documents that communicate the details of software updates to users and developers, yet their generation remains a time-consuming and error-prone process. In this paper, we present VerLog, a novel technique that enhances the generation of software release notes using Large Language Models (LLMs). VerLog leverages few-shot in-context learning with adaptive prompting to facilitate the graph reasoning capabilities of LLMs, enabling them to accurately interpret and document the semantic information of code changes. Additionally, VerLog incorporates multi-granularity information, including fine-grained code modifications and high-level non-code artifacts, to guide the generation process and ensure comprehensive, accurate, and readable release notes. We applied VerLog to the 42 releases of 248 unique Android applications and conducted extensive evaluations. Our results demonstrate that VerLog significantly (up to 18%–21% higher precision, recall, and F1) outperforms state-of-the-art baselines in terms of completeness, accuracy, readability, and overall quality of the generated release notes, in both controlled experiments with high-quality reference release notes and in-the-wild evaluations.

  • Research Article
  • 10.33693/2313-223x-2025-12-3-67-79
Analysis of software code preprocessing methods to improve the effectiveness of using large language models in vulnerability detection tasks
  • Nov 2, 2025
  • Computational nanotechnology
  • Valery V Charugin + 3 more

As software systems grow in scale and complexity, the need for intelligent methods of vulnerability detection increases. One such method involves the use of large language models trained on source code, which are capable of analyzing and classifying vulnerable code segments at early stages of development. The effectiveness of these models depends on how the code is represented and how the input data is prepared. Preprocessing methods can significantly impact the accuracy and robustness of the model. The purpose of the study: to analyze the impact of various code preprocessing methods on the accuracy and robustness of large language models (CodeBERT, GraphCodeBERT, UniXcoder) in vulnerability detection tasks. The analysis is conducted using source code changes extracted from commits associated with vulnerabilities documented in the CVE database. The research methodology is an experimental analysis based on evaluation of the effectiveness and robustness of CodeBERT, GraphCodeBERT, and UniXcoder in the task of vulnerability classification. The models are assessed based on their performance using Accuracy and F1 score metrics. Research results: estimates of the effectiveness of different code preprocessing methods when applying large language models to vulnerability classification tasks.

  • Research Article
  • Cite Count Icon 54
  • 10.1109/tse.2022.3201209
Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests
  • Apr 1, 2023
  • IEEE Transactions on Software Engineering
  • Sakina Fatima + 2 more

Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and 73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of 98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and 18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.

Save Icon
Up Arrow
Open/Close