AI-Driven Refactoring: A Pipeline for Identifying and Correcting Data Clumps in Git Repositories

Nils Baumgartner,Padma Iyenghar,Padma Iyenghar,Elke Pulvermüller,Timo Schoemaker

doi:10.3390/electronics13091644

Abstract

Data clumps, groups of variables that repeatedly appear together across different parts of a software system, are indicative of poor code structure and can lead to potential issues such as maintenance challenges, testing complexity, and scalability concerns, among others. Addressing this, our study introduces an innovative AI-driven pipeline specifically designed for the refactoring of data clumps in software repositories. This pipeline leverages the capabilities of Large Language Models (LLM), such as ChatGPT, to automate the detection and resolution of data clumps, thereby enhancing code quality and maintainability. In developing this pipeline, we have taken into consideration the new European Union (EU)-Artificial Intelligence (AI) Act, ensuring that our pipeline complies with the latest regulatory requirements and ethical standards for use of AI in software development by outsourcing decisions to a human in the loop. Preliminary experiments utilizing ChatGPT were conducted to validate the effectiveness and efficiency of our approach. These tests demonstrate promising results in identifying and refactoring data clumps, but also the challenges using LLMs.

Full Text