AutoGEEval++: A multi-level and multi-geospatial-modality automated evaluation framework for large language models in geospatial code generation on Google Earth Engine

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Geospatial code generation is crucial in integrating AI with geo-scientific analysis, but standardized evaluation tools are lacking. This study presents AutoGEEval++, an enhanced framework for evaluating large language models (LLMs) that generate geospatial code on the Google Earth Engine (GEE) platform. Built on the GEE Python API, AutoGEEval++ includes a benchmark dataset—AutoGEEval++-Bench—comprising 6,365 test cases across 26 GEE data types and three task categories: unit test, combination test, and theme test. The framework offers a fully automated evaluation pipeline, from code generation to execution-based validation, using multi-dimensional metrics such as accuracy, resource consumption, runtime efficiency, and error types. It also supports boundary testing and error pattern analysis. We assess 24 leading LLMs (as of June 2025) spanning general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models. Experimental results highlight distinct performance, stability, and error patterns, demonstrating the framework’s scalability for vertical-domain code generation. This study establishes the first standardized evaluation protocol and resource suite for GEE-based LLM code generation, providing a unified benchmark and a methodology for evaluating the transition from natural language to domain-specific code, advancing geospatial AI research.

Similar Papers
  • Research Article
  • 10.3390/ijgi14070256
AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models
  • Jun 30, 2025
  • ISPRS International Journal of Geo-Information
  • Huayi Wu + 10 more

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline—from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs—including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models—revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

  • Research Article
  • 10.1080/13658816.2025.2577252
Extraction of geoprocessing modeling knowledge from crowdsourced Google Earth Engine scripts by coordinating large and small language models
  • Nov 1, 2025
  • International Journal of Geographical Information Science
  • Anqi Zhao + 7 more

The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).

  • Research Article
  • Cite Count Icon 1
  • 10.3390/aerospace12060498
Using Large Language Models for Aerospace Code Generation: Methods, Benchmarks, and Potential Values
  • May 30, 2025
  • Aerospace
  • Rui He + 4 more

In recent years, Large Language Models (LLMs) have witnessed rapid advancements, revolutionizing various domains. Within the realm of software development, code generation technology powered by LLMs has emerged as a prominent research focus. Despite its potential, the application of this technology in the aerospace sector remains in its nascent, exploratory phase. This paper delves into the intricacies of LLM-based code generation methods and explores their potential applications in aerospace contexts. It introduces RepoSpace, the pioneering warehouse-level benchmark test for code generation of spaceborne equipment. Comprising 825 samples from five actual projects, this benchmark offers a more precise evaluation of LLMs’ capabilities in aerospace scenarios. Through extensive evaluations of seven state-of-the-art LLMs on RepoSpace, the study reveals that domain-specific differences significantly impact the code generation performance of LLMs. Existing LLMs exhibit subpar performance in specialized warehouse-level code generation tasks for aerospace, with their performance markedly lower than that of domain tasks. The research further demonstrates that Retrieval Augmented Generation (RAG) technology can effectively enhance LLMs’ code generation capabilities. Additionally, the use of appropriate prompt templates can guide the models to achieve superior results. Moreover, high-quality documentation strings are found to be crucial in improving LLMs’ performance in warehouse-level code generation tasks. This study provides a vital reference for leveraging LLMs for code generation in the aerospace field, thereby fostering technological innovation and progress in this critical domain.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3728963
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Ruiqi Wang + 5 more

Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3715908
Large Language Model-Aware In-Context Learning for Code Generation
  • Feb 28, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Chongyang Tao + 5 more

Large Language Models (LLMs) have shown impressive In-Context Learning (ICL) ability in code generation. LLMs take a prompt context consisting of a few demonstration examples and a new requirement as input, and output new programs without any parameter update. Existing studies have found that the performance of ICL-based code generation heavily depends on the quality of demonstration examples and thus arises research on selecting demonstration examples: given a new requirement, a few demonstration examples are selected from a candidate pool, where LLMs are expected to learn the pattern hidden in these selected demonstration examples. Existing approaches are mostly based on heuristics or randomly selecting examples. However, the distribution of randomly selected examples usually varies greatly, making the performance of LLMs less robust. The heuristics retrieve examples by only considering textual similarities of requirements, leading to sub-optimal performance. To fill this gap, we propose a L arge language model- A ware selection approach for I n-context- L earning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement. Positive examples are helpful for LLMs to generate correct programs, while negative examples are trivial and should be ignored. Based on the labeled positive and negative data, LAIL trains a model-aware retriever to learn the preference of LLMs and select demonstration examples that LLMs need. During the inference, given a new requirement, LAIL uses the trained retriever to select a few examples and feed them into LLMs to generate desired programs. We apply LAIL to four widely used LLMs and evaluate it on five code generation datasets. Extensive experiments demonstrate that LAIL outperforms the state-of-the-art (SOTA) baselines by 11.58%, 3.33%, and 5.07% on CodeGen-Multi-16B, 1.32%, 2.29%, and 1.20% on CodeLlama-34B, and achieves 4.38%, 2.85%, and 2.74% improvements on Text-davinci-003 in terms of Pass@1 at MBJP, MBPP, and MBCPP, respectively. In addition to function-level code generation, LAIL improves the performance of LLMs on DevEval, a repository-level code generation dataset, which achieves 10.04%, 8.12%, and 4.63% improvements compared to the SOTA baselines at Pass@1, 3, and 5 on CodeLlama-7B. Human evaluation further verifies that the generated programs of LAIL are superior in correctness, code quality, and maintainability. Besides, LAIL has satisfactory transferability across different LLMs and datasets, where the retriever learned on one LLM (dataset) can be transferred to other LLMs (datasets).

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.15388/lmitt.2024.20
Unit Test Generation Using Large Language Models: A Systematic Literature Review
  • May 13, 2024
  • Vilnius University Open Series
  • Dovydas Marius Zapkus + 1 more

Unit testing is a fundamental aspect of software development, ensuring the correctness and robustness of code implementations. Traditionally, unit tests are manually crafted by developers based on their understanding of the code and its requirements. However, this process can be time-consuming, errorprone, and may overlook certain edge cases. In recent years, there has been growing interest in leveraging large language models (LLMs) for automating the generation of unit tests. LLMs, such as GPT (Generative Pre-trained Transformer), CodeT5, StarCoder, LLaMA, have demonstrated remarkable capabilities in natural language understanding and code generation tasks. By using LLMs, researchers aim to develop techniques that automatically generate unit tests from code snippets or specifications, thus optimizing the software testing process. This paper presents a literature review of articles that use LLMs for unit test generation tasks. It also discusses the history of the most commonly used large language models and their parameters, including the first time they have been used for code generation tasks. The result of this study presents the large language models for code and unit test generation tasks and their increasing popularity in code generation domain, indicating a great promise for the future of unit test generation using LLMs.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/10095020.2025.2505556
GEE-OPs: an operator knowledge base for geospatial code generation on the Google Earth Engine platform powered by large language models
  • May 21, 2025
  • Geo-spatial Information Science
  • Shuyang Hou + 3 more

As spatiotemporal data grows in complexity, utilizing geospatial modeling on the Google Earth Engine (GEE) platform poses challenges in improving coding efficiency for experts and enhancing the coding capabilities of interdisciplinary users. To address these challenges, we propose a framework for constructing a geospatial operator knowledge base tailored to the GEE JavaScript API. The framework includes an operator syntax knowledge table, an operator relationship frequency knowledge table, an operator frequent pattern knowledge table, and an operator relationship chain knowledge table. Leveraging Syntax Tree (AST) techniques and frequent itemset mining, we extract operator knowledge from 295,943 real GEE scripts and syntax documentation, forming a structured knowledge base. Experimental results demonstrate that the proposed framework achieves an accuracy ranging from 87% to 93% in operator relationship extraction tasks, measured by accuracy, recall, and F1 score metrics. In operator relationship chain extraction tasks, the framework achieves a performance range of 0.79 to 0.89 across LCS, Ngram, Siamese, and BERT-based evaluations. In geospatial code generation tasks, GEE-OPs improves the executability of mainstream Large Language Models (LLMs) by 38.0% to 44.9%, enhances correctness by 24.1% to 47.2%, and boosts readability by 4.7% to 7.6%. Ablation experiments further validate the essential role of each knowledge table in enhancing model performance. Additionally, key performance indicators – including response time, lines of code, token consumption, and memory usage – are documented to assist readers in replicating the experiments and gaining deeper insights into system performance. This work advances geospatial code modeling techniques and facilitates the application of LLMs in geoinformatics, contributing to the integration of generative AI into the field.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3715109
HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent
  • Jan 27, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Jie Jw Wu + 1 more

Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trend is using LLM-based agents to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions to reduce Ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. For this purpose, we define the communication skills of LLMs as “being able to ask clarifying questions when the description of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions about the requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as unclear intent specification may be alleviated, resulting to a correct code in the initial iterations. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned above, Inconsistency , Ambiguity , Incompleteness . We then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, C o de C l a rificatio n a nd G eneration A ge n t (Okanagan), to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. In the evaluation, we introduced an LLM-based evaluator and created Communication Rate and Good Question Rate as the evaluation metrics to represent the ratio of questions asked and questions with good quality in responses. We found that more than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. The Pass@1 and Test Pass Rate of most Code LLMs drop by 35% \(\sim\) 52% and by 17% \(\sim\) 35% respectively, with statistical significance in each category for over 75% numbers. Okanagan, as an LLM agent approach that uses LLM such as ChatGPT 3.5, effectively increases the Communication Rate and Good Question Rate by an absolute 58% and 38%, respectively. Thus, Okanagan boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%, respectively, when the problem descriptions are modified based on given clarification categories. This result indicates the potential for achieving more effective communication capability using LLM agent. Our benchmark and full code are publicly available at https://github.com/jie-jw-wu/human-eval-comm .

  • Research Article
  • Cite Count Icon 3
  • 10.1145/3770084
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages
  • Oct 7, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Sathvik Joel + 2 more

Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs and DSLs face unique challenges, including severe data scarcity and, for DSLs, highly specialized syntax and semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications, including financial and scientific works. While several surveys on LLMs for software engineering and code exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs. Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27,000 published studies from 2020 – 2024 to understand the capabilities and limitations of LLMs in these specialized domains. We also expanded our literature search to include 5 recent papers from 2024 – 2025. We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as strategies used to enhance LLM performance, and the collected datasets and curation methods in this context. We identified four main evaluation techniques used in the literature, along with several metrics to assess code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main groups and summarized the novel methods and architectures proposed by the researchers. We also classified different approaches used for data collection and preparation. While different techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to evaluate code generation in several LRPLs and DSLs. We discuss several distinctions of the studied approaches with the ones used in high-resource programming languages (HRPLs), as well as several challenges unique to these languages, especially DSLs. The challenges stem from the scarcity of data, the unique requirements, and specialized domains, which often need expertise guidelines or domain-specific tools. Accordingly, we provide insights into different research opportunities for the studied aspects. This survey serves as a comprehensive resource for researchers and practitioners working at the intersection of LLMs, software engineering, and specialized programming languages, providing a foundation for future advancements in LRPL and DSL code generation. A GitHub repository was created to organize the papers of this survey at https://github.com/jie-jw-wu/Survey-CodeLLM4LowResource-DSL .

  • Research Article
  • 10.55041/ijsrem36242
ProgAI: Enhancing Code Generation with LLMs For Real World Challenges
  • Jul 4, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Afsal Ahamad A + 2 more

Large Language Models (LLMs) have shown promise in automated code generation but generate code units with errors because of reasons like hallucinations. Real-world soft- ware development, however, often involves complex requirements with complex dependencies and extensive documentation. To fill this gap, our research pivots towards evaluating LLMs in a more realistic setting real- world repo-level code generation. We introduce ProgAI, a manually curated LLM for proficient code generation. This LLM supports Code generation 4 coding languages – namely C++, Java, Python and C. We assess nine leading LLMs on code generation tasks and observe a decline in their performance. To tackle this, we present ProgAI, a novel LLM-based agent framework that employs external tools for effective code generation. ProgAI integrates four programming tools, enabling interaction with software artifacts for information retrieval, code symbol navigation, and code testing. We implement four agent strategies to optimize these tools’ usage. Our experiments on ProgAI show that ProgAI enhances LLM performance significantly, with improvements ranging from 18.1% to 25%. Further tests on the HumanEval benchmark confirm ProgAI’s adaptability and efficacy across various code generation tasks. Notably, ProgAI outperforms commercial products like Github Copilot, showcasing superior accuracy and efficiency. These results demonstrate ProgAI’s robust capabilities in code generation, highlighting its potential for real-world repo-level coding challenges.

  • Research Article
  • Cite Count Icon 1
  • 10.47363/jaicc/2023(2)442
AI-Powered Code Generation Evaluating the Effectiveness of Large Language Models (LLMs) in Automated Software Development
  • Mar 31, 2023
  • Journal of Artificial Intelligence & Cloud Computing
  • Ravikanth Konda

The rapid evolution of Artificial Intelligence (AI) has brought about significant advancements in multiple domains, including software development. One of the most promising innovations is AI-powered code generation through Large Language Models (LLMs), such as OpenAI’s GPT-3 and GPT-4. These models, having been trained on large amounts of programming data, have the ability to produce human-readable code from natural language inputs, which is a big potential for simplifying and optimizing software development processes. The aim of this paper is to analyze the performance of LLMs in automated software development by testing their performance on a variety of tasks such as code generation, debugging, and optimization of software. The research explores both the strengths and weaknesses that these models have to offer, in terms of some of the most important indicators like code quality, generation time, and maintainability of the code. According to our observation, although LLMs hold immense potential to automate mundane programming tasks and enhance developer productivity, they still struggle to cope with more intricate, domain-specific programming tasks involving a higher level of understanding, for example, designing architectures and top-level decision-making. In spite of such shortcomings, LLMs can tremendously enhance software development processes, particularly for small-scale projects or act as helpers for more senior developers. The paper summarizes by reflecting on the potential for LLMs to transform software development processes in the future, while also the importance of the model's reliability, coding quality, and security to be improved if it is to be made applicable to larger, more crucial uses.

  • Research Article
  • 10.1088/1742-6596/1486/2/022020
Model of the impact of traffic congestion based on Google Earth Engine: take Zhongguancun Street as an example
  • Apr 1, 2020
  • Journal of Physics: Conference Series
  • Yukang Fan

The traffic congestion in Beijing has become a significantly serious problem and is hindering the development of the city. It also causes inconvenience of traveling, especially commuting for people. Google earth and Google Earth Engine (GEE), can provide much information about the traffic and surrounding environment. However, there are few studies exist to utilize the GEE and traffic data to analyze the effect of the congestion on the city`s development and people`s life. Therefore, to explore and analyze the causes of traffic congestion and ultimately to put forward a viable solution, we propose to model the impact of traffic congestion based on Google Earth and Google Earth engine, taking Zhongguancun Street as an example. The results show that the congestion at the intersection of the main street has a radiation influence of about three kilometers on the main road. And according to the big data, we find that at 8:00 am and 6:00 pm every day, it is the peak time of traffic congestion. Finally, we can conclude that the GEE platform is a profound and potential tool for effectively analyzing traffic problems, and our subsequent research can be continued to be further developed based on this platform.

  • Research Article
  • 10.1145/3728947
The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation
  • Jun 22, 2025
  • Proceedings of the ACM on Software Engineering
  • Yingjie Fu + 4 more

The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.

  • Research Article
  • 10.1145/3772721
Exploring Data-Efficient Adaptation of Large Language Models for Code Generation
  • Oct 27, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Xue Jiang + 5 more

Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for D ata- E fficient adaptation with E rror- D riven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3660807
Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?
  • Jul 12, 2024
  • Proceedings of the ACM on Software Engineering
  • Bonan Kou + 4 more

Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.