AI programming has become a popular topic in recent years. Code suggestion, with code suggestion being a key capability of AI programming. Copilot, an “AI programmer” that provides code suggestions from natural language descriptions, has been launched by GitHub and OpenAI. By far, Copilot has been widely used by millions of developers. However, little work has systematically evaluated the correctness of Copilot's suggestions. We conducted an empirical study on all 2,033 LeetCode problems to assess Copilot's code generation across four mainstream languages: C, Java, JavaScript, and Python. We have found that: 1) 70.0% of problems received at least one correct suggestion, with language-specific rates of 29.7% (C), 57.7% (Java), 54.1% (JavaScript), and 41.0% (Python); 2) Correctness decreases as problem difficulty increases, with acceptance rates of 89.3% (Easy), 72.1% (Medium), and 43.4% (Hard); 3) Acceptance rates vary across problem domains from 49.5% to 90.1%, while Graph problems challenge C and Python most, and Prefix Sum and Heap challenge Java and JavaScript most; 4) For the incorrect suggestions, we further summarize 17 types of error reasons accounting for their incorrectness and analyzed possible causes for why these errors occur. We believe our study can provide valuable insights into Copilot's capabilities and limitations.
Read full abstract