Detecting privacy-sensitive code changes with language modeling
At Meta, we work to incorporate privacy-by-design into all of our products and keep user information secure. We have created an ML model that detects code changes ("diffs") that have privacy-sensitive implications. At our scale of tens of thousands of engineers creating hundreds of thousands of diffs each month, we use automated tools for detecting such diffs. Inspired by recent studies on detecting defects [2, 3, 5] and security vulnerabilities [4, 6, 7], we use techniques from natural language processing to build a deep learning system for detecting privacy-sensitive code.
- Conference Article
4
- 10.1145/3691620.3694999
- Oct 27, 2024
Recent studies indicate that traditional techniques for understanding code changes are not as effective as techniques that directly prompt language models (LMs). However, current LM-based techniques heavily rely on expensive, large LMs (LLMs) such as GPT-4 and Llama-13b, which are either commercial or prohibitively costly to deploy on a wide scale, thereby restricting their practical applicability. This paper explores the feasibility of deploying small LMs (SLMs) while maintaining comparable or superior performance to LLMs in code change understanding. To achieve this, we created a small yet high-quality dataset called HQCM which was meticulously reviewed, revised, and validated by five human experts. We fine-tuned state-of-the-art 7b and 220m SLMs using HQCM and compared them with traditional techniques and LLMs with ≥70b parameters. Our evaluation confirmed HQCM's benefits and demonstrated that SLMs, after finetuning by HQCM, can achieve superior performance in three change understanding tasks: change summarization, change classification, and code refinement. This study supports the use of SLMs in environments with security, computational, and financial constraints, such as in industry scenarios and on edge devices, distinguishing our work from the others.
- Research Article
43
- 10.1016/j.infsof.2021.106566
- Mar 10, 2021
- Information and Software Technology
Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model
- Research Article
- 10.1145/3728944
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Stable patch classification plays a crucial role in vulnerability management for the Linux kernel, significantly contributing to the stability and security of Long-term support(LTS) versions. Although existing tools have effectively assisted in assessing whether patches should be merged into stable versions, they cannot determine which stable patches should be merged into which LTS versions. This process still requires the maintainers of the distribution community to manually screen based on the requirements of their respective versions.To address this issue, we propose PatchScope, which is designed to predict the specific merge status of patches.Patchscope consists of two components: patch analysis and patch classification.Patch analysis leverages Large Language Models(LLMs) to generate detailed patch descriptions from the commit message and code changes, thereby deepening the model's semantic understanding of patches. Patch classification utilizes a pre-trained language model to extract semantic features of the patches and employs a two-stage classifier to predict the merge status of the patches.The model is optimized using the dynamic weighted loss function to handle data imbalance and improve overall performance.Given that the primary focus is maintaining Linux kernel versions 5.10 and 6.6, we have conducted comparative experiments based on these two versions. Experimental results demonstrate that Patchscope can effectively predict the merge status of patches.
- Conference Article
688
- 10.1145/3394486.3406703
- Aug 20, 2020
Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), which at the time of its release was the largest publicly known language model at 17 billion parameters. In addition we will also go over our latest transformer kernel advancements that led the DeepSpeed team to achieve the world fastest BERT pretraining record. The Zero Redundancy Optimizer (ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with over 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology. DeepSpeed brings state-of-the-art training techniques, such as ZeRO, optimized kernels, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch. With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address underlying performance challenges and boost the speed and scale of your training.
- Conference Article
- 10.1109/ase63991.2025.00245
- Nov 16, 2025
Code review is critical for ensuring software quality and maintainability. With the rapid growth in software scale and complexity, code review has become a bottleneck in the development process because of its time-consuming and knowledge-intensive nature and the shortage of experienced developers willing to review code. Several approaches have been proposed for automatically generating code reviews based on retrieval, neural machine translation, pre-trained models, or large language models (LLMs). These approaches mainly leverage historical code changes and review comments. However, a large amount of crucial information for code review, such as the context of code changes and prior review knowledge, has been overlooked. This paper proposes an LLM-based review knowledge-augmented, context-aware framework for code review generation, named LAURA. The framework integrates review exemplar retrieval, context augmentation, and systematic guidance to enhance the performance of ChatGPT-4o and DeepSeek v3 in generating code review comments. Besides, given the extensive low-quality reviews in existing datasets, we also constructed a high-quality dataset. Experimental results show that for both models, LAURA generates review comments that are either completely correct or at least helpful to developers in 42.2% and 40.4% of cases, respectively, significantly outperforming SOTA baselines. Furthermore, our ablation studies demonstrate that all components of LAURA contribute positively to improving comment quality.
- Research Article
- 10.52710/cfs.978
- Mar 13, 2026
- Computer Fraud and Security
It is far from adequate to detect performance regressions in production Java microservices without proper attribution and resolution, especially in large, rapidly changing codebases without wide-ranging human involvement. This is challenging in modern continuous delivery environments, where multiple commits are bundled into a release and the root cause is inferred from heterogeneous streams of performance telemetry, version control, and incident history. In this article, let’s build an end-to-end system for code change analysis with multi-modal feature engineering, gradient-boosted tree classification, SHAP-based explanations, and large language model code generation. And design an ensemble XGBoost model that learns the non-linear mapping from code change to runtime impact. By using SHAP values in order to give theoretically principled, plain-language relevance explanations that ensure engineer trust that these models are calibrated. A LoRA fine-tuned GPT-4 model then writes production-ready code changes through an AI-orchestrated pull request workflow, with human approval and staged deployment verification remaining mandatory as gates. The automation becomes an accelerant to engineering judgment rather than a substitute for it. The system is continuously retrained based on feedback from engineers to accommodate codebase changes.
- Research Article
7
- 10.1109/jiot.2025.3531512
- Aug 15, 2025
- IEEE Internet of Things Journal
With the rapid advancement of flexible manufacturing in the Industrial Internet of Things (IIoT), there has been a significant increase in the number of IIoT devices and application software aimed at meeting various needs. The software defects may lead to delays or crashes in flexible manufacturing system, thereby affecting the production schedule. Automated software defect localization based on code changes can significantly reduce development and maintenance time costs, thereby maintaining the competitive edge of flexible manufacturing in the IIoT. Current efforts in software defect localization are primarily based on deep learning models or information retrieval models. This article investigates the performance of large language models (LLMs) in software defect localization and optimizes localization accuracy by combining it with an information retrieval model. Our empirical study reveals that GPT, given a software defect description, is unable to determine whether specific code changes are relevant. The model is unable to provide accurate answers, which aligns with the generative nature of LLMs where responses are generated according to probability distributions. However, the combined framework of LLMs and information retrieval models proposed in this article outperforms the current state-of-the-art models on public datasets. We conclude that LLMs can enhance localization performance when used as side information in conjunction with existing information retrieval models. The effectiveness of the framework has been validated through experiments conducted on publicly available datasets and in practical applications within IIoT projects. This offers valuable insights into the application and development of LLMs for defect localization in the software development and maintenance processes in the IIoT flexible manufacturing.
- Research Article
4
- 10.1145/3735129
- Jan 21, 2026
- ACM Transactions on Software Engineering and Methodology
Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.
- Research Article
27
- 10.1145/3709358
- Jul 1, 2025
- ACM Transactions on Software Engineering and Methodology
Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \(>\) 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but Llama 2 and Code Llama families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.
- Conference Article
1
- 10.18653/v1/2024.emnlp-main.749
- Jan 1, 2024
Writing comprehensive commit messages is tedious yet important, because these messages describe changes of code, such as fixing bugs or adding new features.However, most existing methods focus on either only the changed lines or nearest context lines, without considering the effectiveness of selecting useful contexts.On the other hand, it is possible that introducing excessive contexts can lead to noise.To this end, we propose a code model COMMIT (Context-aware prOMpting based comMIt-message generaTion) in conjunction with a code dataset CODEC (COntext and metaData Enhanced Code dataset).Leveraging program slicing, CODEC consolidates code changes along with related contexts via property graph analysis.Further, utilizing CodeT5+ as the backbone model, we train COMMIT via context-aware prompt on CODEC.Experiments show that COMMIT can surpass all compared models including pre-trained language models for code (code-PLMs) such as Com-mitBART and large language models for code (code-LLMs) such as Code-LlaMa.Besides, we investigate several research questions (RQs), further verifying the effectiveness of our approach.We release the data and code at:
- Research Article
1
- 10.5281/zenodo.4266643
- May 7, 2020
- Zenodo (CERN European Organization for Nuclear Research)
Replication Package of Augmenting Commit Classification by using Fine-Grained SourceCode Changes and a Pre-trained Deep Neural Language Model
- Research Article
1
- 10.1145/3728961
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Release notes are essential documents that communicate the details of software updates to users and developers, yet their generation remains a time-consuming and error-prone process. In this paper, we present VerLog, a novel technique that enhances the generation of software release notes using Large Language Models (LLMs). VerLog leverages few-shot in-context learning with adaptive prompting to facilitate the graph reasoning capabilities of LLMs, enabling them to accurately interpret and document the semantic information of code changes. Additionally, VerLog incorporates multi-granularity information, including fine-grained code modifications and high-level non-code artifacts, to guide the generation process and ensure comprehensive, accurate, and readable release notes. We applied VerLog to the 42 releases of 248 unique Android applications and conducted extensive evaluations. Our results demonstrate that VerLog significantly (up to 18%–21% higher precision, recall, and F1) outperforms state-of-the-art baselines in terms of completeness, accuracy, readability, and overall quality of the generated release notes, in both controlled experiments with high-quality reference release notes and in-the-wild evaluations.
- Research Article
- 10.33693/2313-223x-2025-12-3-67-79
- Nov 2, 2025
- Computational nanotechnology
As software systems grow in scale and complexity, the need for intelligent methods of vulnerability detection increases. One such method involves the use of large language models trained on source code, which are capable of analyzing and classifying vulnerable code segments at early stages of development. The effectiveness of these models depends on how the code is represented and how the input data is prepared. Preprocessing methods can significantly impact the accuracy and robustness of the model. The purpose of the study: to analyze the impact of various code preprocessing methods on the accuracy and robustness of large language models (CodeBERT, GraphCodeBERT, UniXcoder) in vulnerability detection tasks. The analysis is conducted using source code changes extracted from commits associated with vulnerabilities documented in the CVE database. The research methodology is an experimental analysis based on evaluation of the effectiveness and robustness of CodeBERT, GraphCodeBERT, and UniXcoder in the task of vulnerability classification. The models are assessed based on their performance using Accuracy and F1 score metrics. Research results: estimates of the effectiveness of different code preprocessing methods when applying large language models to vulnerability classification tasks.
- Research Article
- 10.54364/aaiml.2024.44171
- Jan 1, 2024
- Advances in Artificial Intelligence and Machine Learning
Version Control Systems (VCS) manage source code changes by storing modifications in a database. A key feature of VCS is the commit function, which saves the project’s current state and summarizes changes through Commit Message (CM). These messages are vital for collaboration, particularly in open-source artificial intelligence (AI) projects on platforms, where contributors work on rapidly evolving codebases. This paper presents an empirical analysis of CM within open-source AI repositories on GitHub, focusing on their content, the effectiveness of categorization by Large Language Models (LLMs), and the impact of message quality on categorization accuracy. A sample of 384 CMs from 34 repositories was manually categorized to establish a taxonomy. Python was then used for automated keyword extraction, refined with regex patterns. Also, an experiment involved assessing the performance of ChatGPT-4 in categorizing CMs, first without guidance and later using our developed taxonomy. Our findings indicate that the quality of CMs varies greatly, which has a clear impact on how efficiently they can be categorized. This study contributes to the field by providing a structured taxonomy of CMs and exploring how tools like ChatGPT-4 can be used to analyze them. The insights from this research are intended to benefit both academic studies and real-world software development, particularly by helping teams better understand and automate the handling of CM in AI projects.
- Conference Article
- 10.1145/3719027.3760720
- Nov 19, 2025
Vulnerability patches are essential for managing vulnerabilities in Open-source software (OSS). However, accurately identifying them remains difficult. Existing methods mainly rely on rule-based matching and are not well-suited to ecosystems like WordPress plugins, due to the lack of a unified development standard. In contrast, methods that combine vulnerability descriptions with code changes demonstrate greater potential. However, current prediction models lack deep semantic understanding and thus cannot fully understand the meaning behind the changes.