ML-Powered Root Cause Analysis and Automated Remediation for Java Microservices
It is far from adequate to detect performance regressions in production Java microservices without proper attribution and resolution, especially in large, rapidly changing codebases without wide-ranging human involvement. This is challenging in modern continuous delivery environments, where multiple commits are bundled into a release and the root cause is inferred from heterogeneous streams of performance telemetry, version control, and incident history. In this article, let’s build an end-to-end system for code change analysis with multi-modal feature engineering, gradient-boosted tree classification, SHAP-based explanations, and large language model code generation. And design an ensemble XGBoost model that learns the non-linear mapping from code change to runtime impact. By using SHAP values in order to give theoretically principled, plain-language relevance explanations that ensure engineer trust that these models are calibrated. A LoRA fine-tuned GPT-4 model then writes production-ready code changes through an AI-orchestrated pull request workflow, with human approval and staged deployment verification remaining mandatory as gates. The automation becomes an accelerant to engineering judgment rather than a substitute for it. The system is continuously retrained based on feedback from engineers to accommodate codebase changes.
- Research Article
- 10.1007/s00330-026-12445-3
- Mar 14, 2026
- European radiology
While large language models (LLMs) have shown promise in medical text analysis, their application in automated medical billing code extraction remains underexplored, particularly for the German medical fee schedule system (GOÄ). Therefore, an LLM was fine-tuned to perform multi-label classification of GOÄ codes from radiology reports automatically, and its performance was compared with state-of-the-art commercial and open-source LLMs. Following ethics committee approval, we analyzed 499,601 radiology reports from 124,497 patients, containing 1,799,971 manually identified GOÄ codes as ground truth. The MediPhi-Instruct 4B model was fine-tuned using five-fold cross-validation. Performance was evaluated on the hold-out test set and compared against GPT-5, GPT-4.1, GPT-oss, Kimi-K2, Deepseek-R1, Deepseek-V3, Gemini 2.5, Llama-70B, and Qwen-3 LLMs on a subset of 500 anonymized and 350 cleaned reports using zero-shot and few-shot prompting techniques. The fine-tuned model achieved an accuracy of 77.15% ± 0.47% and a micro-average F1-score of 87.79% ± 0.31% on the hold-out test set. On a subset of 500 real-world samples, our models outperformed the best-performing LLM, Gemini 2.5 Flash, with an F1-score of 70.32% ± 1.54% compared to 58.22% ± 1.50% (p < 0.001). For the cleaned dataset of 350 samples, GPT-5 achieved the best F1-score of 89.51 ± 1.52% and outperformed the fine-tuned models (p < 0.001). Fine-tuned LLMs can effectively automate GOÄ code classification from radiology reports, with the potential of outperforming commercial LLMs. This approach shows promise for improving billing efficiency and accuracy in healthcare settings, though manual verification is still recommended. Question LLMs with high parameters possess medical knowledge, but how effective are they at predicting billing codes from radiology reports compared to smaller, fine-tuned models? Finidngs A fine-tuned ensemble model achieved competitive results and can outperform larger, proprietary LLMs. Clinical relevance Smaller, fine-tuned models offer an efficient alternative to proprietary LLMs in generating billing codes and can be integrated to assist clinical coding. This technology has the potential to transform clinical billing procedures, but its use should be overseen by qualified professional personnel.
- Conference Article
15
- 10.1109/apsec.2016.028
- Jan 1, 2016
It is generally said that we should not perform code changes formultiple tasks in a single commit. Such code changes are called tangledones. Committing tangled changes is harmful to developers. Forexample, it is costly to merge a part of tangled changes with othercommits. Moreover, the presence of such tangled changes hindersanalyzing code repositories. That is because most of the miningsoftware repository approaches are designed under the assumption thatevery commit includes only changes for a single task. In this paper, wepropose a technique which informs developers that they are about tocommit tangled changes. The technique also suggests how to split agiven commit into multiple commits by using past code changes. Theproposed technique allows developers to determine whether they acceptthe suggestion or commit as it stands. By providing such support todevelopers, they can avoid committing tangled changes.
- Conference Article
1
- 10.22323/1.378.0025
- Oct 22, 2021
Context: Natural Language Processing (NLP) is a branch of artificial intelligence that extracts information from language. In the field of software engineering, NLP has been employed to extract key information from free-form text, to generate models from the analysis of text or to categorize code changes according to their commit messages. In literature, most of the approaches NLP-based focused on the impact of code changes on program execution or software architecture. Objective: In this study, we have applied NLP to code-change data to identify patterns of software code modifications and used Machine Learning techniques to build a model that determines how software has evolved over time and identifies area of code that presents problems. Method: Considering that software projects use version control systems, such as github, to manage their code, we have collected software information by using git commands. These data contain different unstructured information about the various files in a project. Each modification entry includes a message that explains the reasons for the change. According to the content of the message, it is possible to identify key terms that can be used during the classification of the entries. Results: In this study, we have considered the change history of software available on github to the High Energy Physics community. With the use of NLP techniques we have cleaned the messages and extracted some key terms to categorize both software problems and some other changes performed by developers, like the addition of a third party dependency or a script that starts a given service. We have built a code change dictionary combining the terms already in existing literature with the ones gathered directly from the software and its github repository. Finally, we have applied some Machine Learning (ML) techniques to determine any connection between code changes and software problems: we have removed redundant entries to avoid any bias in the outcomes of the ML techniques. Conclusion: We show in detail our approach adopted to construct historical code change datasets of categorized commit messages by following a multi-label classification methodology. Our model performance seems promising in terms of accuracy, precision, recall and F1-score.
- Research Article
4
- 10.3389/fdata.2024.1501154
- Nov 28, 2024
- Frontiers in Big Data
BackgroundAccurate sentiment analysis and intent categorization of tobacco and e-cigarette-related social media content are critical for public health research, yet they necessitate specialized natural language processing approaches.ObjectiveTo compare pre-trained and fine-tuned Flan-T5 models for intent classification and sentiment analysis of tobacco and e-cigarette tweets, demonstrating the effectiveness of pre-training a lightweight large language model for domain specific tasks.MethodsThree Flan-T5 classification models were developed: (1) tobacco intent, (2) e-cigarette intent, and (3) sentiment analysis. Domain-specific datasets with tobacco and e-cigarette tweets were created using GPT-4 and validated by tobacco control specialists using a rigorous evaluation process. A standardized rubric and consensus mechanism involving domain specialists ensured high-quality datasets. The Flan-T5 Large Language Models were fine-tuned using Low-Rank Adaptation and evaluated against pre-trained baselines on the datasets using accuracy performance metrics. To further assess model generalizability and robustness, the fine-tuned models were evaluated on real-world tweets collected around the COP9 event.ResultsIn every task, fine-tuned models performed much better than pre-trained models. Compared to the pre-trained model's accuracy of 0.33, the fine-tuned model achieved an overall accuracy of 0.91 for tobacco intent classification. The fine-tuned model achieved an accuracy of 0.93 for e-cigarette intent, which is higher than the accuracy of 0.36 for the pre-trained model. The fine-tuned model significantly outperformed the pre-trained model's accuracy of 0.65 in sentiment analysis, achieving an accuracy of 0.94 for sentiments.ConclusionThe effectiveness of lightweight Flan-T5 models in analyzing tweets associated with tobacco and e-cigarette is significantly improved by domain-specific fine-tuning, providing highly accurate instruments for tracking public conversation on tobacco and e-cigarette. The involvement of domain specialists in dataset validation ensured that the generated content accurately represented real-world discussions, thereby enhancing the quality and reliability of the results. Research on tobacco control and the formulation of public policy could be informed by these findings.
- Research Article
4
- 10.1145/3735129
- Jan 21, 2026
- ACM Transactions on Software Engineering and Methodology
Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.
- Research Article
7
- 10.1111/coin.12651
- Jun 1, 2024
- Computational Intelligence
Cost‐sensitive ensemble learning as a combination of two approaches, ensemble learning and cost‐sensitive learning, enables generation of cost‐sensitive tree‐based ensemble models using the cost‐sensitive decision tree (CSDT) learning algorithm. In general, tree‐based models characterize nice graphical representation that can explain a model's decision‐making process. However, the depth of the tree and the number of base models in the ensemble can be a limiting factor in comprehending the model's decision for each sample. The CSDT models are widely used in finance (e.g., credit scoring and fraud detection) but lack effective explanation methods. We previously addressed this gap with cost‐sensitive tree Shapley Additive Explanation Method (CSTreeSHAP), a cost‐sensitive tree explanation method for the single‐tree CSDT model. Here, we extend the introduced methodology to cost‐sensitive ensemble models, particularly cost‐sensitive random forest models. The paper details the theoretical foundation and implementation details of CSTreeSHAP for both single CSDT and ensemble models. The usefulness of the proposed method is demonstrated by providing explanations for single and ensemble CSDT models trained on well‐known benchmark credit scoring datasets. Finally, we apply our methodology and analyze the stability of explanations for those models compared to the cost‐insensitive tree‐based models. Our analysis reveals statistically significant differences between SHAP values despite seemingly similar global feature importance plots of the models. This highlights the value of our methodology as a comprehensive tool for explaining CSDT models.
- Conference Article
2
- 10.1145/3524610.3528386
- May 16, 2022
Detecting refactorings in commit history is essential to improve the comprehension of code changes in code reviews and to provide valuable information for empirical studies on software evolution. Several techniques have been proposed to detect refactorings accurately at the granularity level of a single commit. However, refactorings may be performed over multiple commits because of code complexity or other real development problems, which is why attempting to detect refactorings at single-commit granularity is insufficient. We observe that some refactorings can be detected only at coarser granularity, that is, changes spread across multiple commits. Herein, this type of refactoring is referred to as coarse-grained refactoring (CGR). We compared the refactorings detected on different granularities of commits from 19 open-source repositories. The results show that CGRs are common, and their frequency increases as the granularity becomes coarser. In addition, we found that Move-related refactorings tended to be the most frequent CGRs. We also analyzed the causes of CGR and suggested that CGRs will be valuable in refactoring research.
- Conference Article
5
- 10.1145/3196398.3196406
- May 28, 2018
When we use code repositories, each commit should include code changes for only a single task and code changes for a single task should not be scattered over multiple commits. There are many studies on the former violation-often referred to as tangled commits- but the latter violation has been out of scope for MSR research. In this paper, we firstly investigate how much and what kinds of inappropriately partitioned commits in Java projects. Then, we propose a simple technique to detect such commits automatically. We also report evaluation results of the proposed technique.
- Conference Article
4
- 10.1109/msr.2019.00018
- May 1, 2019
Many techniques have been proposed for mining software repositories, predicting code quality and evaluating code changes. Prior work has established links between code ownership and churn metrics, and software quality at file and directory level based on changes that fix bugs. Other metrics have been used to evaluate individual code changes based on preceding changes that induce fixes. This paper combines the two approaches in an empirical study of assessing risk of code changes using established code ownership and churn metrics with fix inducing changes on a large proprietary code repository. We establish a machine learning model for change risk classification which achieves average precision of 0.76 using metrics from prior works and 0.90 using a wider array of metrics. Our results suggest that code ownership metrics can be applied in change risk classification models based on fix inducing changes.
- Conference Article
96
- 10.1145/3127005.3127016
- Nov 8, 2017
Background: Understanding maintenance activities performed in a source code repository could help practitioners reduce uncertainty and improve cost-effectiveness by planning ahead and pre-allocating resources towards source code maintenance. The research community uses 3 main classification categories for maintenance activities: Corrective: fault fixing; Perfective: system improvements; Adaptive: new feature introduction. Previous work in this area has mostly concentrated on evaluating commit classification (into maintenance activities) models in the scope of a single software project.
- Conference Article
- 10.1109/compsac.2017.155
- Jul 1, 2017
Each commit in repositories of version control systems should include code changes for only a single task. However, in real repositories, there are many commits for multiple tasks and tasks split into multiple commits. We call the latter split commits. In this research, we firstly investigate how many and what kinds of split commits are included in repositories. Then, we classify the found split commits into three categories. Based on the classification, we propose a new technique to detect split commits automatically. This is the first research that proposes a technique to detect split commits. To evaluate the proposed technique, we apply it to repositories of two open source software. The results show that the proposed technique detects split commits with high accuracy (precision is 0.8 and F-measure is 0.7).
- Supplementary Content
2
- 10.5167/uzh-61703
- Jan 1, 2012
- Zurich Open Repository and Archive (University of Zurich)
Software development and, in particular, software maintenance are time consuming and require detailed knowledge of the structure and the past development activities of a software system. Limited resources and time constraints make the situation even more difficult. Therefore, a significant amount of research effort has been dedicated to learning software prediction models that allow project members to allocate and spend the limited resources efficiently on the (most) critical parts of their software system. Prominent examples are bug prediction models and change prediction models: Bug prediction models identify the bug-prone modules of a software system that should be tested with care; change prediction models identify modules that change frequently and in combination with other modules, i.e., they are change coupled. By combining statistical methods, data mining approaches, and machine learning techniques software prediction models provide a structured and analytical basis to make decisions.Researchers proposed a wide range of approaches to build effective prediction models that take into account multiple aspects of the software development process. They achieved especially good prediction performance, guiding developers towards those parts of their system where a large share of bugs can be expected. For that, they rely on change data provided by version control systems (VCS). However, due to the fact that current VCS track code changes only on file-level and textual basis most of those approaches suffer from coarse-grained and rather generic change information. More fine-grained change information, for instance, at the level of source code statements, and the type of changes, e.g., whether a method was renamed or a condition expression was changed, are often not taken into account. Therefore, investigating the development process and the evolution of software at a fine-grained change level has recently experienced an increasing attention in research.The key contribution of this thesis is to improve software prediction models by using fine-grained source code changes. Those changes are based on the abstract syntax tree structure of source code and allow us to track code changes at the fine-grained level of individual statements. We show with a series of empirical studies using the change history of open-source projects how prediction models can benefit in terms of prediction performance and prediction granularity from the more detailed change information.First, we compare fine-grained source code changes and code churn, i.e., lines modified, for bug prediction. The results with data from the Eclipse platform show that fine grained-source code changes significantly outperform code churn when classifying source files into bug- and not bug-prone, as well as when predicting the number of bugs in source files. Moreover, these results give more insights about the relation of individual types of code changes, e.g., method declaration changes and bugs. For instance, in our dataset method declaration changes exhibit a stronger correlation with the number of bugs than class declaration changes.Second, we leverage fine-grained source code changes to predict bugs at method-level. This is beneficial as files can grow arbitrarily large. Hence, if bugs are predicted at the level of files a developer needs to manually inspect all methods of a file one by one until a particular bug is located.Third, we build models using source code properties, e.g., complexity, to predict whether a source file will be affected by a certain type of code change. Predicting the type of changes is of practical interest, for instance, in the context of software testing as different change types require different levels of testing: While for small statement changes local unit-tests are mostly sufficient, API changes, e.g., method declaration changes, might require system-wide integration-tests which are more expensive. Hence, knowing (in advance) which types of changes will most likely occur in a source file can help to better plan and develop tests, and, in case of limited resources, prioritize among different types of testing.Finally, to assist developers in bug triaging we compute prediction models based on the attributes of a bug report that can be used to estimate whether a bug will be fixed fast or whether it will take more time for resolution.The results and findings of this thesis give evidence that fine-grained source code changes can improve software prediction models to provide more accurate results.
- Research Article
57
- 10.1038/s41598-022-15231-5
- Jul 5, 2022
- Scientific Reports
This study investigated the usefulness of deep learning-based automatic detection of anterior disc displacement (ADD) from magnetic resonance imaging (MRI) of patients with temporomandibular joint disorder (TMD). Sagittal MRI images of 2520 TMJs were collected from 861 men and 399 women (average age 37.33 ± 18.83 years). A deep learning algorithm with a convolutional neural network was developed. Data augmentation and the Adam optimizer were applied to reduce the risk of overfitting the deep-learning model. The prediction performances were compared between the models and human experts based on areas under the curve (AUCs). The fine-tuning model showed excellent prediction performance (AUC = 0.8775) and acceptable accuracy (approximately 77%). Comparing the AUC values of the from-scratch (0.8269) and freeze models (0.5858) showed lower performances of the other models compared to the fine-tuning model. In Grad-CAM visualizations, the fine-tuning scheme focused more on the TMJ disc when judging ADD, and the sparsity was higher than that of the from-scratch scheme (84.69% vs. 55.61%, p < 0.05). The three fine-tuned ensemble models using different data augmentation techniques showed a prediction accuracy of 83%. Moreover, the AUC values of ADD were higher when patients with TMD were divided by age (0.8549–0.9275) and sex (male: 0.8483, female: 0.9276). While the accuracy of the ensemble model was higher than that of human experts, the difference was not significant (p = 0.1987–0.0671). Learning from pre-trained weights allowed the fine-tuning model to outperform the from-scratch model. Another benefit of the fine-tuning model for diagnosing ADD of TMJ in Grad-CAM analysis was the deactivation of unwanted gradient values to provide clearer visualizations compared to the from-scratch model. The Grad-CAM visualizations also agreed with the model learned through important features in the joint disc area. The accuracy was further improved by an ensemble of three fine-tuning models using diversified data. The main benefits of this model were the higher specificity compared to human experts, which may be useful for preventing true negative cases, and the maintenance of its prediction accuracy across sexes and ages, suggesting a generalized prediction.
- Research Article
35
- 10.1142/s2424922x21410023
- Jul 1, 2021
- Advances in Data Science and Adaptive Analysis
In every educational institution, predicting pupils’ performance is a vital responsibility. Due to this, a variety of data mining techniques, such as clustering, classification, and regression, are applied to anticipate the learner’s study behavior. By Machine Learning’s arrival, it has become vital to forecast students’ academic achievement, and this study attracts significant attention within the scientific community. In addition, the findings from this work have tremendous socio-economic consequences. One area of major research in the world of education today is educational data mining, which is the study of techniques to reveal hidden patterns in educational data. Data mining strategies succeed or fail to depend on the type and quality of the data that is being mined. Here, we provide a novel method that enhances the accuracy of prior student performance prediction by identifying and providing an explanation as to why it is rising. Using our robust machine learning ensemble models, we propose and evaluate a prediction model. The findings demonstrate that our CatBoost — an ensemble machine learning model — is superior to standard machine learning models with an accuracy of 92.27%. This new model was able to show itself to be dependable by the use of smote and hyperparameter optimization, which proved to be valuable methods and approaches. Additional features are significant as well. More critically, a unique method is utilized to increase model transparency. The SHAP values are a valuable part of the student performance prediction system, which we think should be integrated. For those educators tasked with using prediction models in education, we have found that there is a preference for models that offer both insightful insights and easy to understand predictions, as by utilizing our experiment the educator will be able to identify those students who are at early risk and inspire and encourage these students in a positive way.
- Research Article
41
- 10.1007/s10664-018-9676-8
- Jan 2, 2019
- Empirical Software Engineering
Change-based code review is a software quality assurance technique that is widely used in practice. Therefore, better understanding what influences performance in code reviews and finding ways to improve it can have a large impact. In this study, we examine the association of working memory capacity and cognitive load with code review performance and we test the predictions of a recent theory regarding improved code review efficiency with certain code change part orders. We perform a confirmatory experiment with 50 participants, mostly professional software developers. The participants performed code reviews on one small and two larger code changes from an open source software system to which we had seeded additional defects. We measured their efficiency and effectiveness in defect detection, their working memory capacity, and several potential confounding factors. We find that there is a moderate association between working memory capacity and the effectiveness of finding delocalized defects, influenced by other factors, whereas the association with other defect types is almost non-existing. We also confirm that the effectiveness of reviews is significantly larger for small code changes. We cannot conclude reliably whether the order of presenting the code change parts influences the efficiency of code review.