Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

Li Zhong,Zilong Wang

doi:10.1609/aaai.v38i19.30185

Abstract

Recently, large language models (LLMs) have shown an extraordinary ability to understand natural language and generate programming code. It has been a common practice for software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability, and robustness of the code generation from LLMs have not yet been thoroughly studied. The executable code is not equivalent to reliable and robust code, especially in the context of real-world software development. For example, the misuse of APIs in the generated code could lead to severe problems, such as resource leaks, program crashes, etc. Existing code evaluation benchmarks and datasets focus on crafting small tasks such as programming questions in coding interviews, which, however, deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from Stack Overflow on 18 representative Java APIs. We summarize the common misuse patterns of these APIs and evaluate them on current popular LLMs. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause unexpected consequences if the code is introduced into real-world software.

Full Text