ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
In recent years, with the widespread attention of academia and industry on the application of large language models (LLMs) to code-related tasks, an increasing number of large code models (LCMs) have been proposed and corresponding evaluation benchmarks have continually emerged. Although existing evaluation benchmarks are helpful for comparing different LCMs, they may not reflect the performance of LCMs in various development scenarios. Specifically, they might evaluate model performance in only one type of scenario (e.g., code generation or code completion), whereas real development contexts are diverse and may involve multiple tasks such as code generation, code completion, API recommendation, and test function generation. Additionally, the questions may not originate from actual development practices, failing to capture the programming challenges faced by developers during the development process.