ML-Powered Root Cause Analysis and Automated Remediation for Java Microservices

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

It is far from adequate to detect performance regressions in production Java microservices without proper attribution and resolution, especially in large, rapidly changing codebases without wide-ranging human involvement. This is challenging in modern continuous delivery environments, where multiple commits are bundled into a release and the root cause is inferred from heterogeneous streams of performance telemetry, version control, and incident history. In this article, let’s build an end-to-end system for code change analysis with multi-modal feature engineering, gradient-boosted tree classification, SHAP-based explanations, and large language model code generation. And design an ensemble XGBoost model that learns the non-linear mapping from code change to runtime impact. By using SHAP values in order to give theoretically principled, plain-language relevance explanations that ensure engineer trust that these models are calibrated. A LoRA fine-tuned GPT-4 model then writes production-ready code changes through an AI-orchestrated pull request workflow, with human approval and staged deployment verification remaining mandatory as gates. The automation becomes an accelerant to engineering judgment rather than a substitute for it. The system is continuously retrained based on feedback from engineers to accommodate codebase changes.

Similar Papers
  • Research Article
  • 10.1007/s00330-026-12445-3
Comparison of proprietary and fine-tuned large language models for multi-label classification of billing codes from radiology reports.
  • Mar 14, 2026
  • European radiology
  • Kamyar Arzideh + 12 more

While large language models (LLMs) have shown promise in medical text analysis, their application in automated medical billing code extraction remains underexplored, particularly for the German medical fee schedule system (GOÄ). Therefore, an LLM was fine-tuned to perform multi-label classification of GOÄ codes from radiology reports automatically, and its performance was compared with state-of-the-art commercial and open-source LLMs. Following ethics committee approval, we analyzed 499,601 radiology reports from 124,497 patients, containing 1,799,971 manually identified GOÄ codes as ground truth. The MediPhi-Instruct 4B model was fine-tuned using five-fold cross-validation. Performance was evaluated on the hold-out test set and compared against GPT-5, GPT-4.1, GPT-oss, Kimi-K2, Deepseek-R1, Deepseek-V3, Gemini 2.5, Llama-70B, and Qwen-3 LLMs on a subset of 500 anonymized and 350 cleaned reports using zero-shot and few-shot prompting techniques. The fine-tuned model achieved an accuracy of 77.15% ± 0.47% and a micro-average F1-score of 87.79% ± 0.31% on the hold-out test set. On a subset of 500 real-world samples, our models outperformed the best-performing LLM, Gemini 2.5 Flash, with an F1-score of 70.32% ± 1.54% compared to 58.22% ± 1.50% (p < 0.001). For the cleaned dataset of 350 samples, GPT-5 achieved the best F1-score of 89.51 ± 1.52% and outperformed the fine-tuned models (p < 0.001). Fine-tuned LLMs can effectively automate GOÄ code classification from radiology reports, with the potential of outperforming commercial LLMs. This approach shows promise for improving billing efficiency and accuracy in healthcare settings, though manual verification is still recommended. Question LLMs with high parameters possess medical knowledge, but how effective are they at predicting billing codes from radiology reports compared to smaller, fine-tuned models? Finidngs A fine-tuned ensemble model achieved competitive results and can outperform larger, proprietary LLMs. Clinical relevance Smaller, fine-tuned models offer an efficient alternative to proprietary LLMs in generating billing codes and can be integrated to assist clinical coding. This technology has the potential to transform clinical billing procedures, but its use should be overseen by qualified professional personnel.

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/apsec.2016.028
Splitting Commits via Past Code Changes
  • Jan 1, 2016
  • Hiroyuki Kirinuki + 3 more

It is generally said that we should not perform code changes formultiple tasks in a single commit. Such code changes are called tangledones. Committing tangled changes is harmful to developers. Forexample, it is costly to merge a part of tangled changes with othercommits. Moreover, the presence of such tangled changes hindersanalyzing code repositories. That is because most of the miningsoftware repository approaches are designed under the assumption thatevery commit includes only changes for a single task. In this paper, wepropose a technique which informs developers that they are about tocommit tangled changes. The technique also suggests how to split agiven commit into multiple commits by using past code changes. Theproposed technique allows developers to determine whether they acceptthe suggestion or commit as it stands. By providing such support todevelopers, they can avoid committing tangled changes.

  • Conference Article
  • Cite Count Icon 1
  • 10.22323/1.378.0025
Using Natural Language Processing to Extract Information from Unstructured code-change version control data: lessons learned
  • Oct 22, 2021
  • Elisabetta Ronchieri + 2 more

Context: Natural Language Processing (NLP) is a branch of artificial intelligence that extracts information from language. In the field of software engineering, NLP has been employed to extract key information from free-form text, to generate models from the analysis of text or to categorize code changes according to their commit messages. In literature, most of the approaches NLP-based focused on the impact of code changes on program execution or software architecture. Objective: In this study, we have applied NLP to code-change data to identify patterns of software code modifications and used Machine Learning techniques to build a model that determines how software has evolved over time and identifies area of code that presents problems. Method: Considering that software projects use version control systems, such as github, to manage their code, we have collected software information by using git commands. These data contain different unstructured information about the various files in a project. Each modification entry includes a message that explains the reasons for the change. According to the content of the message, it is possible to identify key terms that can be used during the classification of the entries. Results: In this study, we have considered the change history of software available on github to the High Energy Physics community. With the use of NLP techniques we have cleaned the messages and extracted some key terms to categorize both software problems and some other changes performed by developers, like the addition of a third party dependency or a script that starts a given service. We have built a code change dictionary combining the terms already in existing literature with the ones gathered directly from the software and its github repository. Finally, we have applied some Machine Learning (ML) techniques to determine any connection between code changes and software problems: we have removed redundant entries to avoid any bias in the outcomes of the ML techniques. Conclusion: We show in detail our approach adopted to construct historical code change datasets of categorized commit messages by following a multi-label classification methodology. Our model performance seems promising in terms of accuracy, precision, recall and F1-score.

  • Research Article
  • Cite Count Icon 4
  • 10.3389/fdata.2024.1501154
Enhancing sentiment and intent analysis in public health via fine-tuned Large Language Models on tobacco and e-cigarette-related tweets
  • Nov 28, 2024
  • Frontiers in Big Data
  • Sherif Elmitwalli + 3 more

BackgroundAccurate sentiment analysis and intent categorization of tobacco and e-cigarette-related social media content are critical for public health research, yet they necessitate specialized natural language processing approaches.ObjectiveTo compare pre-trained and fine-tuned Flan-T5 models for intent classification and sentiment analysis of tobacco and e-cigarette tweets, demonstrating the effectiveness of pre-training a lightweight large language model for domain specific tasks.MethodsThree Flan-T5 classification models were developed: (1) tobacco intent, (2) e-cigarette intent, and (3) sentiment analysis. Domain-specific datasets with tobacco and e-cigarette tweets were created using GPT-4 and validated by tobacco control specialists using a rigorous evaluation process. A standardized rubric and consensus mechanism involving domain specialists ensured high-quality datasets. The Flan-T5 Large Language Models were fine-tuned using Low-Rank Adaptation and evaluated against pre-trained baselines on the datasets using accuracy performance metrics. To further assess model generalizability and robustness, the fine-tuned models were evaluated on real-world tweets collected around the COP9 event.ResultsIn every task, fine-tuned models performed much better than pre-trained models. Compared to the pre-trained model's accuracy of 0.33, the fine-tuned model achieved an overall accuracy of 0.91 for tobacco intent classification. The fine-tuned model achieved an accuracy of 0.93 for e-cigarette intent, which is higher than the accuracy of 0.36 for the pre-trained model. The fine-tuned model significantly outperformed the pre-trained model's accuracy of 0.65 in sentiment analysis, achieving an accuracy of 0.94 for sentiments.ConclusionThe effectiveness of lightweight Flan-T5 models in analyzing tweets associated with tobacco and e-cigarette is significantly improved by domain-specific fine-tuning, providing highly accurate instruments for tracking public conversation on tobacco and e-cigarette. The involvement of domain specialists in dataset validation ensured that the generated content accurately represented real-world discussions, thereby enhancing the quality and reliability of the results. Research on tobacco control and the formulation of public policy could be informed by these findings.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3735129
MORepair : Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning
  • Jan 21, 2026
  • ACM Transactions on Software Engineering and Methodology
  • Boyang Yang + 7 more

Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.

  • Research Article
  • Cite Count Icon 7
  • 10.1111/coin.12651
Cost‐sensitive tree SHAP for explaining cost‐sensitive tree‐based models
  • Jun 1, 2024
  • Computational Intelligence
  • Marija Kopanja + 3 more

Cost‐sensitive ensemble learning as a combination of two approaches, ensemble learning and cost‐sensitive learning, enables generation of cost‐sensitive tree‐based ensemble models using the cost‐sensitive decision tree (CSDT) learning algorithm. In general, tree‐based models characterize nice graphical representation that can explain a model's decision‐making process. However, the depth of the tree and the number of base models in the ensemble can be a limiting factor in comprehending the model's decision for each sample. The CSDT models are widely used in finance (e.g., credit scoring and fraud detection) but lack effective explanation methods. We previously addressed this gap with cost‐sensitive tree Shapley Additive Explanation Method (CSTreeSHAP), a cost‐sensitive tree explanation method for the single‐tree CSDT model. Here, we extend the introduced methodology to cost‐sensitive ensemble models, particularly cost‐sensitive random forest models. The paper details the theoretical foundation and implementation details of CSTreeSHAP for both single CSDT and ensemble models. The usefulness of the proposed method is demonstrated by providing explanations for single and ensemble CSDT models trained on well‐known benchmark credit scoring datasets. Finally, we apply our methodology and analyze the stability of explanations for those models compared to the cost‐insensitive tree‐based models. Our analysis reveals statistically significant differences between SHAP values despite seemingly similar global feature importance plots of the models. This highlights the value of our methodology as a comprehensive tool for explaining CSDT models.

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/3524610.3528386
Impact of change granularity in refactoring detection
  • May 16, 2022
  • Lei Chen + 1 more

Detecting refactorings in commit history is essential to improve the comprehension of code changes in code reviews and to provide valuable information for empirical studies on software evolution. Several techniques have been proposed to detect refactorings accurately at the granularity level of a single commit. However, refactorings may be performed over multiple commits because of code complexity or other real development problems, which is why attempting to detect refactorings at single-commit granularity is insufficient. We observe that some refactorings can be detected only at coarser granularity, that is, changes spread across multiple commits. Herein, this type of refactoring is referred to as coarse-grained refactoring (CGR). We compared the refactorings detected on different granularities of commits from 19 open-source repositories. The results show that CGRs are common, and their frequency increases as the granularity becomes coarser. In addition, we found that Move-related refactorings tended to be the most frequent CGRs. We also analyzed the causes of CGR and suggested that CGRs will be valuable in refactoring research.

  • Conference Article
  • Cite Count Icon 5
  • 10.1145/3196398.3196406
A study on inappropriately partitioned commits
  • May 28, 2018
  • Ryo Arima + 2 more

When we use code repositories, each commit should include code changes for only a single task and code changes for a single task should not be scattered over multiple commits. There are many studies on the former violation-often referred to as tangled commits- but the latter violation has been out of scope for MSR research. In this paper, we firstly investigate how much and what kinds of inappropriately partitioned commits in Java projects. Then, we propose a simple technique to detect such commits automatically. We also report evaluation results of the proposed technique.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/msr.2019.00018
Empirical Study in using Version Histories for Change Risk Classification
  • May 1, 2019
  • Max Kiehn + 2 more

Many techniques have been proposed for mining software repositories, predicting code quality and evaluating code changes. Prior work has established links between code ownership and churn metrics, and software quality at file and directory level based on changes that fix bugs. Other metrics have been used to evaluate individual code changes based on preceding changes that induce fixes. This paper combines the two approaches in an empirical study of assessing risk of code changes using established code ownership and churn metrics with fix inducing changes on a large proprietary code repository. We establish a machine learning model for change risk classification which achieves average precision of 0.76 using metrics from prior works and 0.90 using a wider array of metrics. Our results suggest that code ownership metrics can be applied in change risk classification models based on fix inducing changes.

  • Conference Article
  • Cite Count Icon 96
  • 10.1145/3127005.3127016
Boosting Automatic Commit Classification Into Maintenance Activities By Utilizing Source Code Changes
  • Nov 8, 2017
  • Stanislav Levin + 1 more

Background: Understanding maintenance activities performed in a source code repository could help practitioners reduce uncertainty and improve cost-effectiveness by planning ahead and pre-allocating resources towards source code maintenance. The research community uses 3 main classification categories for maintenance activities: Corrective: fault fixing; Perfective: system improvements; Adaptive: new feature introduction. Previous work in this area has mostly concentrated on evaluating commit classification (into maintenance activities) models in the scope of a single software project.

  • Conference Article
  • 10.1109/compsac.2017.155
Investigation and Detection of Split Commit
  • Jul 1, 2017
  • Ryo Arima + 2 more

Each commit in repositories of version control systems should include code changes for only a single task. However, in real repositories, there are many commits for multiple tasks and tasks split into multiple commits. We call the latter split commits. In this research, we firstly investigate how many and what kinds of split commits are included in repositories. Then, we classify the found split commits into three categories. Based on the classification, we propose a new technique to detect split commits automatically. This is the first research that proposes a technique to detect split commits. To evaluate the proposed technique, we apply it to repositories of two open source software. The results show that the proposed technique detects split commits with high accuracy (precision is 0.8 and F-measure is 0.7).

  • Supplementary Content
  • Cite Count Icon 2
  • 10.5167/uzh-61703
Fine-grained code changes and bugs: Improving bug prediction
  • Jan 1, 2012
  • Zurich Open Repository and Archive (University of Zurich)
  • Emanuel Giger

Software development and, in particular, software maintenance are time consuming and require detailed knowledge of the structure and the past development activities of a software system. Limited resources and time constraints make the situation even more difficult. Therefore, a significant amount of research effort has been dedicated to learning software prediction models that allow project members to allocate and spend the limited resources efficiently on the (most) critical parts of their software system. Prominent examples are bug prediction models and change prediction models: Bug prediction models identify the bug-prone modules of a software system that should be tested with care; change prediction models identify modules that change frequently and in combination with other modules, i.e., they are change coupled. By combining statistical methods, data mining approaches, and machine learning techniques software prediction models provide a structured and analytical basis to make decisions.Researchers proposed a wide range of approaches to build effective prediction models that take into account multiple aspects of the software development process. They achieved especially good prediction performance, guiding developers towards those parts of their system where a large share of bugs can be expected. For that, they rely on change data provided by version control systems (VCS). However, due to the fact that current VCS track code changes only on file-level and textual basis most of those approaches suffer from coarse-grained and rather generic change information. More fine-grained change information, for instance, at the level of source code statements, and the type of changes, e.g., whether a method was renamed or a condition expression was changed, are often not taken into account. Therefore, investigating the development process and the evolution of software at a fine-grained change level has recently experienced an increasing attention in research.The key contribution of this thesis is to improve software prediction models by using fine-grained source code changes. Those changes are based on the abstract syntax tree structure of source code and allow us to track code changes at the fine-grained level of individual statements. We show with a series of empirical studies using the change history of open-source projects how prediction models can benefit in terms of prediction performance and prediction granularity from the more detailed change information.First, we compare fine-grained source code changes and code churn, i.e., lines modified, for bug prediction. The results with data from the Eclipse platform show that fine grained-source code changes significantly outperform code churn when classifying source files into bug- and not bug-prone, as well as when predicting the number of bugs in source files. Moreover, these results give more insights about the relation of individual types of code changes, e.g., method declaration changes and bugs. For instance, in our dataset method declaration changes exhibit a stronger correlation with the number of bugs than class declaration changes.Second, we leverage fine-grained source code changes to predict bugs at method-level. This is beneficial as files can grow arbitrarily large. Hence, if bugs are predicted at the level of files a developer needs to manually inspect all methods of a file one by one until a particular bug is located.Third, we build models using source code properties, e.g., complexity, to predict whether a source file will be affected by a certain type of code change. Predicting the type of changes is of practical interest, for instance, in the context of software testing as different change types require different levels of testing: While for small statement changes local unit-tests are mostly sufficient, API changes, e.g., method declaration changes, might require system-wide integration-tests which are more expensive. Hence, knowing (in advance) which types of changes will most likely occur in a source file can help to better plan and develop tests, and, in case of limited resources, prioritize among different types of testing.Finally, to assist developers in bug triaging we compute prediction models based on the attributes of a bug report that can be used to estimate whether a bug will be fixed fast or whether it will take more time for resolution.The results and findings of this thesis give evidence that fine-grained source code changes can improve software prediction models to provide more accurate results.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 57
  • 10.1038/s41598-022-15231-5
Advantages of deep learning with convolutional neural network in detecting disc displacement of the temporomandibular joint in magnetic resonance imaging
  • Jul 5, 2022
  • Scientific Reports
  • Yeon-Hee Lee + 4 more

This study investigated the usefulness of deep learning-based automatic detection of anterior disc displacement (ADD) from magnetic resonance imaging (MRI) of patients with temporomandibular joint disorder (TMD). Sagittal MRI images of 2520 TMJs were collected from 861 men and 399 women (average age 37.33 ± 18.83 years). A deep learning algorithm with a convolutional neural network was developed. Data augmentation and the Adam optimizer were applied to reduce the risk of overfitting the deep-learning model. The prediction performances were compared between the models and human experts based on areas under the curve (AUCs). The fine-tuning model showed excellent prediction performance (AUC = 0.8775) and acceptable accuracy (approximately 77%). Comparing the AUC values of the from-scratch (0.8269) and freeze models (0.5858) showed lower performances of the other models compared to the fine-tuning model. In Grad-CAM visualizations, the fine-tuning scheme focused more on the TMJ disc when judging ADD, and the sparsity was higher than that of the from-scratch scheme (84.69% vs. 55.61%, p < 0.05). The three fine-tuned ensemble models using different data augmentation techniques showed a prediction accuracy of 83%. Moreover, the AUC values of ADD were higher when patients with TMD were divided by age (0.8549–0.9275) and sex (male: 0.8483, female: 0.9276). While the accuracy of the ensemble model was higher than that of human experts, the difference was not significant (p = 0.1987–0.0671). Learning from pre-trained weights allowed the fine-tuning model to outperform the from-scratch model. Another benefit of the fine-tuning model for diagnosing ADD of TMJ in Grad-CAM analysis was the deactivation of unwanted gradient values to provide clearer visualizations compared to the from-scratch model. The Grad-CAM visualizations also agreed with the model learned through important features in the joint disc area. The accuracy was further improved by an ensemble of three fine-tuning models using diversified data. The main benefits of this model were the higher specificity compared to human experts, which may be useful for preventing true negative cases, and the maintenance of its prediction accuracy across sexes and ages, suggesting a generalized prediction.

  • Research Article
  • Cite Count Icon 35
  • 10.1142/s2424922x21410023
CatBoost — An Ensemble Machine Learning Model for Prediction and Classification of Student Academic Performance
  • Jul 1, 2021
  • Advances in Data Science and Adaptive Analysis
  • Abhisht Joshi + 5 more

In every educational institution, predicting pupils’ performance is a vital responsibility. Due to this, a variety of data mining techniques, such as clustering, classification, and regression, are applied to anticipate the learner’s study behavior. By Machine Learning’s arrival, it has become vital to forecast students’ academic achievement, and this study attracts significant attention within the scientific community. In addition, the findings from this work have tremendous socio-economic consequences. One area of major research in the world of education today is educational data mining, which is the study of techniques to reveal hidden patterns in educational data. Data mining strategies succeed or fail to depend on the type and quality of the data that is being mined. Here, we provide a novel method that enhances the accuracy of prior student performance prediction by identifying and providing an explanation as to why it is rising. Using our robust machine learning ensemble models, we propose and evaluate a prediction model. The findings demonstrate that our CatBoost — an ensemble machine learning model — is superior to standard machine learning models with an accuracy of 92.27%. This new model was able to show itself to be dependable by the use of smote and hyperparameter optimization, which proved to be valuable methods and approaches. Additional features are significant as well. More critically, a unique method is utilized to increase model transparency. The SHAP values are a valuable part of the student performance prediction system, which we think should be integrated. For those educators tasked with using prediction models in education, we have found that there is a preference for models that offer both insightful insights and easy to understand predictions, as by utilizing our experiment the educator will be able to identify those students who are at early risk and inspire and encourage these students in a positive way.

  • Research Article
  • Cite Count Icon 41
  • 10.1007/s10664-018-9676-8
Associating working memory capacity and code change ordering with code review performance
  • Jan 2, 2019
  • Empirical Software Engineering
  • Tobias Baum + 2 more

Change-based code review is a software quality assurance technique that is widely used in practice. Therefore, better understanding what influences performance in code reviews and finding ways to improve it can have a large impact. In this study, we examine the association of working memory capacity and cognitive load with code review performance and we test the predictions of a recent theory regarding improved code review efficiency with certain code change part orders. We perform a confirmatory experiment with 50 participants, mostly professional software developers. The participants performed code reviews on one small and two larger code changes from an open source software system to which we had seeded additional defects. We measured their efficiency and effectiveness in defect detection, their working memory capacity, and several potential confounding factors. We find that there is a moderate association between working memory capacity and the effectiveness of finding delocalized defects, influenced by other factors, whereas the association with other defect types is almost non-existing. We also confirm that the effectiveness of reviews is significantly larger for small code changes. We cannot conclude reliably whether the order of presenting the code change parts influences the efficiency of code review.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant