Large Language Models for Code Translation: An In-Depth Analysis of Code Smells and Functional Correctness

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The conversion of program code from a given source programming language (PL) to another target PL is known as code translation, and has a wide applicability. Since Large Language Models (LLMs) have shown remarkable performance across different application fields, research considers LLMs to mitigate shortcomings of traditional approaches in code translation. However, existing literature mainly focuses on code correctness and falls short of an investigation of the resulting code quality. Hence, we conduct an in-depth analysis of code smells and code correctness obtained by LLM-based code translations to fill this gap. We consider numerous LLMs, datasets, PLs, and prompts, and we reveal that the prompt selection may have a statistically significant impact on an LLM’s performance. Our analyses further indicate that the code quality can be considered a performance dimension largely independent of the code correctness. Moreover, the exploitation of an LLM’s non-determinism, an iterative repair approach, and the collaboration of LLMs may enhance the performance if used accordingly. Surprisingly, we find that a backtranslation approach poses a viable way for mitigating code quality issues in the source, and that LLMs appear to reproduce code smells which were learned during the training process.

Similar Papers
  • Research Article
  • 10.2478/acss-2025-0013
Analysing Software Quality of AI-Translated Code: A Comparative Study of Large Language Models Using Static Analysis
  • Jan 1, 2025
  • Applied Computer Systems
  • Vikram Bhutani + 2 more

Context: Source code translation enables cross-platform compatibility, code reusability, legacy system migration, and developer collaboration. Numerous state-of-the-art techniques have emerged to address demand for efficient and accurate translation methodologies. Objective: This study compares code translation capabilities of Large Language Models (LLMs), specifically DeepSeek R1 and ChatGPT 4.1, evaluating their proficiency in translating code between programming languages. We systematically assess model outputs through quantitative and qualitative measures, focusing on translation accuracy, execution efficiency, and coding standard conformity. By examining each model’s strengths and limitations, this work provides insights into their applicability for various translation scenarios and contributes to discourse on LLM potential in software engineering. Method: We evaluated translation quality from ChatGPT 4.1 and DeepSeek R1 using SonarQube Analyzer to identify strengths and weaknesses through comprehensive software metrics including translation accuracy, code quality, and clean code attributes. SonarQube’s framework enables objective quantification of maintainability, reliability, technical debt, and code smells which are critical factors in software quality measurement. The protocol involved randomly sampling 500 code instances from 1695 Java programming problems. Java samples were translated to Python by both models, then analysed quantitatively using SonarQube metrics to evaluate adherence to software engineering best practices. Results: This comparative analysis reveals capabilities and limitations of state-of-the-art LLM-based translation systems, providing developers, researchers, and practitioners actionable guidance for model selection. Identified gaps highlight future research directions in automated code translation. Result s demonstrate that DeepSeek R1 consistently generates superior software quality compared to ChatGPT 4.1 across Sonar-Qube metrics.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3728963
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Ruiqi Wang + 5 more

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3715908
Large Language Model-Aware In-Context Learning for Code Generation
  • Feb 28, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Chongyang Tao + 5 more

Large Language Models (LLMs) have shown impressive In-Context Learning (ICL) ability in code generation. LLMs take a prompt context consisting of a few demonstration examples and a new requirement as input, and output new programs without any parameter update. Existing studies have found that the performance of ICL-based code generation heavily depends on the quality of demonstration examples and thus arises research on selecting demonstration examples: given a new requirement, a few demonstration examples are selected from a candidate pool, where LLMs are expected to learn the pattern hidden in these selected demonstration examples. Existing approaches are mostly based on heuristics or randomly selecting examples. However, the distribution of randomly selected examples usually varies greatly, making the performance of LLMs less robust. The heuristics retrieve examples by only considering textual similarities of requirements, leading to sub-optimal performance. To fill this gap, we propose a L arge language model- A ware selection approach for I n-context- L earning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement. Positive examples are helpful for LLMs to generate correct programs, while negative examples are trivial and should be ignored. Based on the labeled positive and negative data, LAIL trains a model-aware retriever to learn the preference of LLMs and select demonstration examples that LLMs need. During the inference, given a new requirement, LAIL uses the trained retriever to select a few examples and feed them into LLMs to generate desired programs. We apply LAIL to four widely used LLMs and evaluate it on five code generation datasets. Extensive experiments demonstrate that LAIL outperforms the state-of-the-art (SOTA) baselines by 11.58%, 3.33%, and 5.07% on CodeGen-Multi-16B, 1.32%, 2.29%, and 1.20% on CodeLlama-34B, and achieves 4.38%, 2.85%, and 2.74% improvements on Text-davinci-003 in terms of Pass@1 at MBJP, MBPP, and MBCPP, respectively. In addition to function-level code generation, LAIL improves the performance of LLMs on DevEval, a repository-level code generation dataset, which achieves 10.04%, 8.12%, and 4.63% improvements compared to the SOTA baselines at Pass@1, 3, and 5 on CodeLlama-7B. Human evaluation further verifies that the generated programs of LAIL are superior in correctness, code quality, and maintainability. Besides, LAIL has satisfactory transferability across different LLMs and datasets, where the retriever learned on one LLM (dataset) can be transferred to other LLMs (datasets).

  • Research Article
  • 10.54254/2755-2721/2025.ld29138
A Systematic Study of LLM-Based Code Translation from Multiple Perspectives
  • Nov 5, 2025
  • Applied and Computational Engineering
  • Jingxuan Yu

Large Language Models (LLMs) have demonstrated powerful capabilities in code analysis, exhibiting a deep understanding of code semantics and functionality. Programming code lies at the heart of software development, and the automation and intelligent generation of code can effectively shorten development cycles and reduce labor costs. Current research on code transformation using Large Language Models is gradually emerging. However, these works vary in research perspective, object, and goal, making it difficult to comprehensively evaluate the advantages and characteristics of Large Language Models in code translation tasks. Moreover, existing code translation primarily focuses on simple code translation for explicit tasks and remains incomplete for code translation of complex software systems.Therefore, this paper analyzes the inherent characteristics of Large Language Models in code translation based on their working mechanisms and characteristics. It comprehensively investigates research on code-intelligent generation using Large Language Models over the past two years, particularly examining effectiveness for complex generation tasks and relevant technologies from both task-oriented and technical perspectives. During this process, the impact of prompt engineering methods in code translation is specifically examined. Through systematic analysis and research, it has been found that Large Language Model systems like ChatGPT [1] are effective for code translation tasks with clear objectives. However, they still exhibit room for improvement in handling complex tasks, such as the inability to accurately translate code with tightly coupled contextual logic and the failure to generate code for complex software-system translation.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3728940
ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Pengyu Xue + 11 more

In recent years, Large Language Models (LLMs) have dramatically advanced the performance of automated code translation, making their computational accuracy score reach up to over 80% on many previous benchmarks. However, most code samples in these benchmarks are short, standalone, statement/method-level, and algorithmic, which is not aligned with practical coding tasks. Therefore, it is still unknown the actual capability of LLMs in translating code samples written for daily development. To achieve this, we construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation. ClassEval-T is extended from ClassEval, a well-known class-level Python code generation benchmark consisting of multiple practical coding topics, such as database operation and game design, and diverse contextual dependencies (e.g., fields, methods, and libraries). It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites. Subsequently, we design three translation strategies (i.e., holistic, min-dependency, and standalone) for class-level code translations and evaluate eight recent LLMs of commercial, general, and code kinds in diverse families and sizes on ClassEval-T. Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark, and obvious discrepancies among LLMs appear, showing the effectiveness of ClassEval-T in measuring recent LLMs. Afterwards, we further discuss the usage scenarios for diverse translation strategies and LLMs' ability to dependency awareness when translating class samples. Finally, 1,243 failure cases made by the best-performing LLM under test are thoroughly analyzed and categorized in this paper for practical guidance and future enlightenment.

  • Research Article
  • Cite Count Icon 24
  • 10.1145/3660778
Exploring and Unleashing the Power of Large Language Models in Automated Code Translation
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Zhen Yang + 9 more

Code translation tools, namely transpilers, are developed for automatic source-to-source translation. Latest learning-based transpilers have shown impressive enhancement against rule-based counterparts in both translation accuracy and readability, owing to their task-specific pre-training on extensive monolingual corpora. Nevertheless, their current performance still remains unsatisfactory for practical deployment, and the associated training resources are also prohibitively expensive. Large Language Models (LLMs), pre-trained on huge amounts of human-written code/text, have shown remarkable performance in many code intelligence tasks due to their powerful generality, even without task-specific re-training/fine-tuning. Thus, LLMs can potentially circumvent the above limitations, but they have not been exhaustively explored yet. This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks, finding that: although certain LLMs have outperformed current transpilers, they still have some accuracy issues, where most of the failures are induced by a lack of comprehension of source programs (38.51%), missing clear instructions on I/O types in translation (14.94%), and ignoring discrepancies between source and target programs (41.38%). Enlightened by the above findings, we further propose UniTrans , a Uni fied code Trans lation framework, applicable to various LLMs, for unleashing their power in this field. Specifically, UniTrans first crafts a series of test cases for target programs with the assistance of source programs. Next, it harnesses the above auto-generated test cases to augment the code translation and then evaluate their correctness via execution. Afterward, UniTrans further (iteratively) repairs incorrectly translated programs prompted by test case execution results. Extensive experiments are conducted on six settings of translation datasets between Python, Java, and C++. Three recent LLMs of diverse sizes, including GPT-3.5 and LLaMA-13B/7B, are tested with UniTrans , and all achieve substantial improvements in terms of computational accuracy and exact match accuracy among almost all translation settings, showing the universal effectiveness of UniTrans in practice.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3770084
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages
  • Oct 7, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Sathvik Joel + 2 more

Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs and DSLs face unique challenges, including severe data scarcity and, for DSLs, highly specialized syntax and semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications, including financial and scientific works. While several surveys on LLMs for software engineering and code exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs. Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27,000 published studies from 2020 – 2024 to understand the capabilities and limitations of LLMs in these specialized domains. We also expanded our literature search to include 5 recent papers from 2024 – 2025. We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as strategies used to enhance LLM performance, and the collected datasets and curation methods in this context. We identified four main evaluation techniques used in the literature, along with several metrics to assess code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main groups and summarized the novel methods and architectures proposed by the researchers. We also classified different approaches used for data collection and preparation. While different techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to evaluate code generation in several LRPLs and DSLs. We discuss several distinctions of the studied approaches with the ones used in high-resource programming languages (HRPLs), as well as several challenges unique to these languages, especially DSLs. The challenges stem from the scarcity of data, the unique requirements, and specialized domains, which often need expertise guidelines or domain-specific tools. Accordingly, we provide insights into different research opportunities for the studied aspects. This survey serves as a comprehensive resource for researchers and practitioners working at the intersection of LLMs, software engineering, and specialized programming languages, providing a foundation for future advancements in LRPL and DSL code generation. A GitHub repository was created to organize the papers of this survey at https://github.com/jie-jw-wu/Survey-CodeLLM4LowResource-DSL .

  • Research Article
  • Cite Count Icon 12
  • 10.1145/3643762
CORE: Resolving Code Quality Issues using LLMs
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Nalin Wadhwa + 7 more

As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The proposer LLM of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The ranker LLM evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. We conduct a variety of experiments on two public benchmarks to show the ability of CORE: (1) to generate code revisions acceptable to both static analysis tools and human reviewers (the latter evaluated with user study on a subset of the Python benchmark), (2) to reduce human review efforts by detecting and eliminating revisions with unintended changes, (3) to readily work across multiple languages (Python and Java), static analysis tools (CodeQL and SonarQube) and quality checks (52 and 10 checks, respectively), and (4) to achieve fix rate comparable to a rule-based automated program repair tool but with much smaller engineering efforts (on the Java benchmark). CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM reduced false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts. We release code, data, and supplementary material publicly at http://aka.ms/COREMSRI.

  • Research Article
  • Cite Count Icon 2
  • 10.34190/icair.4.1.3128
LLM Supply Chain Provenance: A Blockchain-based Approach
  • Dec 4, 2024
  • International Conference on AI Research
  • Shridhar Singh + 1 more

The burgeoning size and complexity of Large Language Models (LLMs) introduce significant challenges in ensuring data integrity. The proliferation of "deep fakes" and manipulated information raises concerns about the vulnerability of LLMs to misinformation. Traditional LLM architectures often lack robust mechanisms for tracking the origin and history of training data. This opaqueness can leave LLMs susceptible to manipulation by malicious actors who inject biased or inaccurate data. This research proposes a novel approach integrating Blockchain Technology (BCT) within the LLM data supply chain. With its core principle of a distributed and immutable ledger, BCT offers a compelling solution to address this challenge. By storing the LLM's data supply chain on a blockchain, we establish a verifiable record of data provenance. This allows for tracing the origin of each data point used to train the LLM, fostering greater transparency and trust in the model's outputs. This decentralised approach minimises the risk of single points of failure and manipulation. Additionally, the immutability of blockchain records ensures that the data provenance remains tamper-proof, further enhancing the trustworthiness of the LLM. Our approach leverages three critical features of BCT to strengthen LLM security: 1) Transaction Anonymity: While data provenance is recorded on the blockchain, identities of data contributors can be anonymised, protecting their privacy while ensuring data integrity. 2) Decentralised Repository: Enhances the system's resilience against potential attacks by distributing the data provenance record across the blockchain network. 3) Block Validation: Rigorous consensus mechanisms ensure the validity of each data point added to the LLM's data supply chain - minimising the risk of incorporating inaccurate or manipulated data into the training process. Using the experimental approach, initial evaluations using simulated LLM training data on a blockchain platform demonstrate the feasibility and effectiveness of the proposed approach in enhancing data integrity. This approach has far-reaching implications for ensuring the trustworthiness of LLMs in various applications.

  • Research Article
  • Cite Count Icon 1
  • 10.47363/jaicc/2023(2)442
AI-Powered Code Generation Evaluating the Effectiveness of Large Language Models (LLMs) in Automated Software Development
  • Mar 31, 2023
  • Journal of Artificial Intelligence & Cloud Computing
  • Ravikanth Konda

The rapid evolution of Artificial Intelligence (AI) has brought about significant advancements in multiple domains, including software development. One of the most promising innovations is AI-powered code generation through Large Language Models (LLMs), such as OpenAI’s GPT-3 and GPT-4. These models, having been trained on large amounts of programming data, have the ability to produce human-readable code from natural language inputs, which is a big potential for simplifying and optimizing software development processes. The aim of this paper is to analyze the performance of LLMs in automated software development by testing their performance on a variety of tasks such as code generation, debugging, and optimization of software. The research explores both the strengths and weaknesses that these models have to offer, in terms of some of the most important indicators like code quality, generation time, and maintainability of the code. According to our observation, although LLMs hold immense potential to automate mundane programming tasks and enhance developer productivity, they still struggle to cope with more intricate, domain-specific programming tasks involving a higher level of understanding, for example, designing architectures and top-level decision-making. In spite of such shortcomings, LLMs can tremendously enhance software development processes, particularly for small-scale projects or act as helpers for more senior developers. The paper summarizes by reflecting on the potential for LLMs to transform software development processes in the future, while also the importance of the model's reliability, coding quality, and security to be improved if it is to be made applicable to larger, more crucial uses.

  • Research Article
  • 10.1158/1557-3265.aimachine-b021
Abstract B021: Current oncological large language model research lacks reproducibility, transparency, and long term support
  • Jul 10, 2025
  • Clinical Cancer Research
  • Tolou Shadbahr + 3 more

Large Language Models (LLMs) have been adopted increasingly in oncology, for example, in structuring data from clinical notes, inferring diagnoses from free text or imaging data, and anonymizing of data. Due to the rapid development pace of LLMs, best practices for conducting and reporting oncological research in these applications have yet to be fully established.We queried PubMed for oncology-related LLM research with the last cutoff set at Dec 31st 2024. We investigated 179 papers. Of these, 131 were removed due to omission criteria, and 48 were structured and reported here. Inclusion criteria were oncology-related research and full research articles. Structured fields included date of submission, acceptance, and publishing, the granularity of model reporting (model family, model snapshot), reporting of key LLM model parameters, availability of source code and data, and programming language and API details. We noted an almost exponential growth of LLM-related publications in oncology, with a relatively short time from authors’ submission to publicly available publication (median 3.7 months, IQR 2.5-5.9 months). Interestingly, despite the relatively short processing time, in 25% of cases, the exact model essential to the publication had been deprecated by the model service providers or a newer version was available at the time of publishing. 35.4% of published research relied solely on a graphical user-interface (GUI) of LLMs such as ChatGPT, while 37.5% reported programmatically API-use, with Python as the most common language. While most publications either fully or partially reported the utilized prompts (75%), only 22.9% reported the exact key model parameters, such as temperature. Even when the temperature parameter was available, 45.4% of these publications used a temperature value larger than 0, resulting in more stochastic answers. Source code was made publicly available in 18.7% of publications that reported using a programming language such as Python or R. While practically all publications (97.9%) reported the used model families such as GPT-4o, Claude 3.5 Sonnet or Llama 3-70B, only 27% reported the exact model snapshot usage such as GPT-4o with snapshot options available for May 13th, August 6th or November 20th in 2024. We exemplify and report shortcomings of recent LLM adoption in oncological research. To alleviate these issues, we propose a checklist to improve reproducibility, transparency, and longevity of LLM research directed at researchers and journals. We propose the following preliminary checklist: exact reporting of model snapshot and model parameter bound to a specific snapshot instead of latest release, API usage instead of GUI chatbots, temperature-parameter equal to 0, assessment of variability across runs, session restarts to avoid biases, and caution in researching models that are bound to be deprecated due to the short turn-around time in LLMs. Additionally, rigorous prompt engineering and especially few-shot learning show potential in optimizing interactions with LLMs, also in oncology. Citation Format: Tolou Shadbahr, Antti S. Rannikko, Tuomas Mirtti, Teemu D. Laajala. Current oncological large language model research lacks reproducibility, transparency, and long term support [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr B021.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3729379
AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation
  • Jun 19, 2025
  • Proceedings of the ACM on Software Engineering
  • Ali Reza Ibrahimzada + 6 more

Code translation transforms programs from one programming language (PL) to another. One prominent use case is application modernization to enhance maintainability and reliability. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with inter- and intra-class dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order . We leveraged AlphaTrans to translate ten real-world open-source projects consisting of ⟨836, 8575, 2719⟩ (application and test) classes, (application and test) methods, and unit tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations’ runtime behavior and functional correctness for 27.03% and 25.14% of the application method fragments. On average, integrated translation and validation takes 34 hours (min=3, max=121) to translate a project, showing its scalability in practice. For the syntactically or semantically incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They fixed the issues in 20.1 hours on average (5.5 hours for the smallest and 34 hours for the largest project) and achieved all passing tests. Without AlphaTrans, translating and validating such big projects could take weeks, if not months.

  • Research Article
  • 10.55041/ijsrem17792
Redefining Software Development: Fine-Tuning Generative AI and Large Language Models for Intelligent Automation
  • Feb 19, 2023
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Subhasis Kundu

This study explores the transformative impact of Generative AI and Large Language Models (LLMs) on software development by leveraging intelligent automation. It delves into sophisticated methods for refining LLMs to enhance code generation, improve adaptive learning abilities, and support autonomous software engineering processes [1] [2]. This study investigates how these technologies can be integrated into current development workflows to tackle issues such as code quality, scalability, and ethical concerns. Innovative strategies to boost model performance have been introduced, such as targeted data augmentation and domain-specific pre-training. The results showed notable advancements in the accuracy, efficiency, and adaptability of code generation across various programming languages and frameworks. Finally, the study discusses the implications of these developments for future software development and outlines a roadmap for further research and industrial implementation. Keywords — Generative AI, Large Language Models, Intelligent Automation, Software Development, Code Generation, Adaptive Learning, Autonomous Engineering, Data Augmentation, Domain-Specific Pre-trainings, Transfer Learning, Code Quality, Ethical Considerations.

  • Research Article
  • 10.7256/2454-0714.2024.4.72022
The Role of LLM in Next-Generation Integrated Development Environments
  • Apr 1, 2024
  • Программные системы и вычислительные методы
  • Azizkhon Yunushon Ishankhonov + 3 more

The role of Large Language Models (LLM) in new generation integrated development environments (IDEs). Tools such as GitHub Copilot, IntelliCode and Alice Code Assistant are explored in the context of their use in programming. The authors examine how LLMs enable the automation of key development tasks, including code autocompletion, error detection, refactoring, and code generation, which result in increased development efficiency and improved code quality. Special emphasis is placed on how LLMs affect developers' cognitive processes, such as problem-solving abilities, creativity, and professional skills. A review of existing integrated development environments that utilize large language models. LLM functionality for code autocompletion, fragment generation, error detection and correction was evaluated. Comparative methods were applied to evaluate the effectiveness of LLM compared to traditional development tools. Special attention was paid to analyzing the cognitive load caused by the use of LLMs and assessing their impact on the creative process. The novelty of the research consists in the complex analysis of LLM application in modern IDEs, as well as in revealing their potential for increasing developers' productivity and improving the quality of program code. It is concluded that LLM integration into IDEs allows not only speeding up the process of code creation, but also considerably increasing its quality due to intellectual support and automation of the routine tasks. However, while the benefits of integrating LLMs into IDEs are clear, limitations related to cognitive load, ethical issues, data security, and the need to maintain a balance between automation and development of programmers' skills are also identified.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/tse.2024.3504286
On Inter-Dataset Code Duplication and Data Leakage in Large Language Models
  • Jan 1, 2025
  • IEEE Transactions on Software Engineering
  • José Antonio Hernández López + 4 more

Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. Data leakage i.e., using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics. Contribution. This paper explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. Study design. We conduct an empirical study using the CODESEARCHNET dataset (CSN), a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Next, we pre-train two versions of LLMs using a subset of CSN: one leaky LLM, which includes the identified intersection in its pre-training set, and one non-leaky LLM that excludes these samples. Finally, we fine-tune both models and compare their performances using fine-tuning test samples that are part of the intersection. Results. Our findings reveal a potential threat to the evaluation of LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. We also demonstrate that this threat is accentuated by the chosen fine-tuning technique. Furthermore, we provide evidence that open-source models such as CODEBERT, GRAPHCODEBERT, and UNIXCODER could be affected by inter-dataset duplication. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to SE researchers on strategies to prevent inter-dataset code duplication.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.