Exploring ChatGPT’s code refactoring capabilities: An empirical study

Kayla DePalma,Izabel Miminoshvili,Chiara Henselder,Kate Moss,Eman Abdullah AlOmar

doi:10.1016/j.eswa.2024.123602

Abstract

ChatGPT has shown great potential in the field of software engineering with its ability to generate code. Yet, ChatGPT’s ability to interpret code has been deemed unreliable and faulty, which causes concern for the platform’s ability to properly refactor code. To confront this concern, we carried out a study to assess ChatGPT’s abilities and limitations in refactoring code. We divided the study into three parts: if ChatGPT can refactor the code, if the refactored code preserves the behavior of the original code segments, and if ChatGPT is capable of providing documentation for the refactored code to provide insights into intent, instructions, and impact. We focused our research specifically on eight quality attributes to use when prompting ChatGPT to refactor our dataset of 40 Java code segments. After collecting the refactored code segments from ChatGPT, as well as data on whether the behavior was preserved, we ran the refactored code through PMD, a source code analyzer, to find programming flaws. We also tested ChatGPT’s accuracy in generating documentation for the refactored code and analyzed the difference between the results of each quality attribute. We conclude that ChatGPT can provide many useful refactoring changes that can improve the code quality which is crucial. ChatGPT offered improved versions of the provided code segments 39 out of 40 times even if it is as simple as suggesting clearer names for variables or better formatting. ChatGPT was able to recommend numerous options ranging from minor changes such as renaming methods and variables to major changes such as modifying the data structure. ChatGPT’s strengths and accuracy were in suggesting minor changes because it had difficulty addressing and understanding complex errors and operations. Although most of the changes were minor, they made significant improvements because converting loops, simplifying calculations, and removing redundant statements have a crucial effect on runtime, memory, and readability. However, our results also indicate how ChatGPT can be unpredictable in its responses which threatens the reliability of ChatGPT. Asking ChatGPT the same prompt often yields different results, so some outputs were more accurate than others. This makes it difficult to fully access ChatGPT’s capabilities due to its variation and inconsistency. Due to ChatGPT’s limitations of its reliance on its data set, it lacks understanding of the broader context so it may occasionally make errors and suggest alternations that are neither applicable nor necessary. Overall, ChatGPT has proved to be a beneficial tool for programming as it is capable of providing advantageous suggestions, even if it is on a small scale. However, human programmers are still needed to oversee these changes and determine their significance. ChatGPT should be used as an aid to programmers since we cannot completely depend on it yet.

Full Text