Evaluating AI Models for Autograding Explain in Plain English Questions: Challenges and Considerations
Code-reading ability has traditionally been under-emphasized in assessments as it is difficult to assess at scale. Prior research has shown that code-reading and code-writing are closely related skills; thus being able to assess and train code reading skills may be necessary for student learning. One way to assess code-reading ability is using Explain in Plain English (EiPE) questions, which ask students to describe what a piece of code does with natural language. Previous research deployed a binary (correct/incorrect) autograder using bigram models that performed comparably with human teaching assistants on student responses. With a dataset of 3,064 student responses from 17 EiPE questions, we investigated multiple autograders for EiPE questions. We evaluated methods as simple as logistic regression trained on bigram features, to more complicated Support Vector Machines (SVMs) trained on embeddings from Large Language Models (LLMs) to GPT-4. We found multiple useful autograders, most with accuracies in the \(86\!\!-\!\!88\%\) range, with different advantages. SVMs trained on LLM embeddings had the highest accuracy; few-shot chat completion with GPT-4 required minimal human effort; pipelines with multiple autograders for specific dimensions (what we call 3D autograders) can provide fine-grained feedback; and code generation with GPT-4 to leverage automatic code testing as a grading mechanism in exchange for slightly more lenient grading standards. While piloting these autograders in a non-major introductory Python course, students had largely similar views of all autograders, although they more often found the GPT-based grader and code-generation graders more helpful and liked the code-generation grader the most.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
- 10.37082/ijirmps.v13.i3.232555
- Jun 7, 2025
- International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences
Introducing the Data Analysis Web Application, a cutting-edge, full-stack platform meticulously engineered to revolutionize how users interact with their data by seamlessly integrating the power of Large Language Models (LLMs). This innovative application stands out by empowering users to query, analyze, and visualize complex datasets directly from a local database using intuitive, natural language inputs rather than complex code or commands. One of the primary goals of this application is to dramatically lower the barrier to entry for data exploration. By enabling users to simply ask questions in plain English or describe the analysis they need, the necessity for advanced technical skills, such as SQL programming or scripting, is significantly reduced. This approach effectively democratizes access to data-driven insights for a much wider audience within any organization. The backend serves as the sophisticated engine driving this capability. It strategically utilizes DSPy as a framework to effectively orchestrate complex interactions with the integrated LLMs. These powerful models are leveraged for critical tasks, including translating diverse natural language requests into precise SQL queries executable against the database, performing detailed trend analysis directly on the data, and interpreting intricate data patterns to synthesize clear, understandable, and actionable insights. Connectivity to the local PostgreSQL database is handled efficiently and reliably via Psycopg2, ensuring real-time data access essential for dynamic analysis and quick turnaround on queries. On the user-facing side, the application is built using the modern Streamlit framework, providing an interactive and highly user-friendly interface. This frontend design makes the process of exploring data, visualizing findings, and interacting with the analytical outputs generated by the LLMs remarkably seamless and efficient, allowing users to focus purely on understanding their data and its implications. Ultimately, by combining robust modern web development principles with the transformative capabilities of LLMs and a reliable local database setup (serving as a foundational proof of concept), this application fundamentally transforms data interaction. It equips teams and individuals with the tools needed to uncover valuable insights quickly and effectively, fostering a truly data-driven environment without the traditional technical overhead.
- Research Article
- 10.2196/77561
- Jan 15, 2026
- JMIR medical informatics
Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored. This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines. We reformatted structured clinical variables from the Parkinson's Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)-based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F1-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability. On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F1-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F1-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F1-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F1-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F1-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation. This study provides an exploratory benchmark of how modern LLMs process structured clinical variables in natural language form. While several models achieved diagnostic performance comparable to LR across both the test and temporal validation datasets, their outputs were sensitive to prompting formats, model choice, and class distributions. Occasional variability across repeated output generations reflected the stochastic nature of LLMs, and lightweight models required supervised fine-tuning for stable generalization. These findings highlight the capabilities and limitations of current LLMs in handling tabular clinical information and underscore the need for cautious application and further investigation.
- Research Article
- 10.1145/3728947
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.
- Research Article
- 10.1145/3737884
- Jul 22, 2025
- ACM Transactions on Computing Education
Explain-in-Plain-English (EiPE) questions are used by some researchers and educators to assess code reading skills. EiPE questions require students to briefly explain (in plain English) the purpose of a given piece of code, without restating what the code does line-by-line. The premise is that novices who can explain the purpose of a piece of code have higher code reading skills than those who can trace the code but cannot see its high-level purpose. However, using natural language in EiPE questions poses challenges. Students (especially those whose first language is not English) may struggle to convey their understanding of the code unambiguously. Also, grading responses written in natural language is time-consuming, requires the design of a rubric, and is difficult to automate. We propose a new code reading question type that addresses these issues with EiPE questions. Given a piece of code involving repetition (in the form of iteration or recursion), the student is asked to provide the output for a set of inputs, where the output for some of these inputs cannot be determined using code tracing alone and requires higher-level code comprehension. In empirical evaluations, using CS1 exams, think-aloud interviews with students, and interviews with instructors, we found that assessments of code reading skills using the new question type are highly consistent with the assessments using EiPE questions, yet are more reliable. These results put forward the proposed question type as another way to assess high-level code reading skills without the issues associated with expressing in natural language or grading responses expressed in natural language.
- Research Article
5
- 10.1145/3715908
- Feb 28, 2025
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) have shown impressive In-Context Learning (ICL) ability in code generation. LLMs take a prompt context consisting of a few demonstration examples and a new requirement as input, and output new programs without any parameter update. Existing studies have found that the performance of ICL-based code generation heavily depends on the quality of demonstration examples and thus arises research on selecting demonstration examples: given a new requirement, a few demonstration examples are selected from a candidate pool, where LLMs are expected to learn the pattern hidden in these selected demonstration examples. Existing approaches are mostly based on heuristics or randomly selecting examples. However, the distribution of randomly selected examples usually varies greatly, making the performance of LLMs less robust. The heuristics retrieve examples by only considering textual similarities of requirements, leading to sub-optimal performance. To fill this gap, we propose a L arge language model- A ware selection approach for I n-context- L earning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement. Positive examples are helpful for LLMs to generate correct programs, while negative examples are trivial and should be ignored. Based on the labeled positive and negative data, LAIL trains a model-aware retriever to learn the preference of LLMs and select demonstration examples that LLMs need. During the inference, given a new requirement, LAIL uses the trained retriever to select a few examples and feed them into LLMs to generate desired programs. We apply LAIL to four widely used LLMs and evaluate it on five code generation datasets. Extensive experiments demonstrate that LAIL outperforms the state-of-the-art (SOTA) baselines by 11.58%, 3.33%, and 5.07% on CodeGen-Multi-16B, 1.32%, 2.29%, and 1.20% on CodeLlama-34B, and achieves 4.38%, 2.85%, and 2.74% improvements on Text-davinci-003 in terms of Pass@1 at MBJP, MBPP, and MBCPP, respectively. In addition to function-level code generation, LAIL improves the performance of LLMs on DevEval, a repository-level code generation dataset, which achieves 10.04%, 8.12%, and 4.63% improvements compared to the SOTA baselines at Pass@1, 3, and 5 on CodeLlama-7B. Human evaluation further verifies that the generated programs of LAIL are superior in correctness, code quality, and maintainability. Besides, LAIL has satisfactory transferability across different LLMs and datasets, where the retriever learned on one LLM (dataset) can be transferred to other LLMs (datasets).
- Research Article
1
- 10.3390/aerospace12060498
- May 30, 2025
- Aerospace
In recent years, Large Language Models (LLMs) have witnessed rapid advancements, revolutionizing various domains. Within the realm of software development, code generation technology powered by LLMs has emerged as a prominent research focus. Despite its potential, the application of this technology in the aerospace sector remains in its nascent, exploratory phase. This paper delves into the intricacies of LLM-based code generation methods and explores their potential applications in aerospace contexts. It introduces RepoSpace, the pioneering warehouse-level benchmark test for code generation of spaceborne equipment. Comprising 825 samples from five actual projects, this benchmark offers a more precise evaluation of LLMs’ capabilities in aerospace scenarios. Through extensive evaluations of seven state-of-the-art LLMs on RepoSpace, the study reveals that domain-specific differences significantly impact the code generation performance of LLMs. Existing LLMs exhibit subpar performance in specialized warehouse-level code generation tasks for aerospace, with their performance markedly lower than that of domain tasks. The research further demonstrates that Retrieval Augmented Generation (RAG) technology can effectively enhance LLMs’ code generation capabilities. Additionally, the use of appropriate prompt templates can guide the models to achieve superior results. Moreover, high-quality documentation strings are found to be crucial in improving LLMs’ performance in warehouse-level code generation tasks. This study provides a vital reference for leveraging LLMs for code generation in the aerospace field, thereby fostering technological innovation and progress in this critical domain.
- Research Article
1
- 10.1145/3715109
- Jan 27, 2025
- ACM Transactions on Software Engineering and Methodology
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. The most recent trend is using LLM-based agents to iterate the code generation process. Based on the observation that top-level software engineers often ask clarifying questions to reduce Ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. For this purpose, we define the communication skills of LLMs as “being able to ask clarifying questions when the description of the code generation problem has issues”. In this study, we restrict these issues to three matters from the software requirement engineering field: inconsistent requirements, ambiguous requirements, and incomplete requirements. By asking probing questions about the requirements of problem descriptions before generating the final code, the challenges of programming with LLMs, such as unclear intent specification may be alleviated, resulting to a correct code in the initial iterations. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues mentioned above, Inconsistency , Ambiguity , Incompleteness . We then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, C o de C l a rificatio n a nd G eneration A ge n t (Okanagan), to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. In the evaluation, we introduced an LLM-based evaluator and created Communication Rate and Good Question Rate as the evaluation metrics to represent the ratio of questions asked and questions with good quality in responses. We found that more than 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. The Pass@1 and Test Pass Rate of most Code LLMs drop by 35% \(\sim\) 52% and by 17% \(\sim\) 35% respectively, with statistical significance in each category for over 75% numbers. Okanagan, as an LLM agent approach that uses LLM such as ChatGPT 3.5, effectively increases the Communication Rate and Good Question Rate by an absolute 58% and 38%, respectively. Thus, Okanagan boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%, respectively, when the problem descriptions are modified based on given clarification categories. This result indicates the potential for achieving more effective communication capability using LLM agent. Our benchmark and full code are publicly available at https://github.com/jie-jw-wu/human-eval-comm .
- Research Article
- 10.1227/neu.0000000000002809_227
- Apr 1, 2024
- Neurosurgery
INTRODUCTION: Surgical research demands the development of clinical registries, often through time-intensive manual chart review. Natural language processing (NLP) may accelerate registry development, and an ideal automatic registry (autoregistry) algorithm would be highly accurate while requiring minimal manual data annotation. NLP approaches including bespoke Regular Expression (RegEx) classifiers and Large Language Models (LLM) possess distinct strengths and weaknesses and have not been compared in the setting of autoregistry development. METHODS: We used an institutional data lake to retrieve 31,502 neurosurgical operative notes. A standardized set of spinal procedures was chosen for inclusion in the autoregistry. 200 manually annotated notes were used for training and testing purposes. RegEx classifiers were engineered to retrieve procedural info from unprocessed notes. A family of 110-million parameter BERT models, including LLM pre-trained on clinical text, was fine-tuned for the same tasks. We also tested a open-source 7-billion parameter LLM chatbot, Vicuna, without fine-tuning. RESULTS: The RegEx classifiers were able to identify spinal procedures and associated vertebral levels in nearly 99% of operative notes. Fine-tuned LLM identified common procedures (e.g. spinal fusion and laminectomy) with greater than 95% accuracy but performed poorly for rarer procedures (e.g. XLIF, corpectomy) and vertebral body identification. Qualitative evaluation of the Vicuna chatbot showed potential for the same tasks, following iteratively refined prompting. CONCLUSIONS: The goal of autoregistry development is to minimize time- and labor-intensive manual chart review. We found that fine-tuned LLM could not match the accuracy and efficiency of the RegEx classifier. However, LLM may be well-suited to expand existing clinical databases that provide a robust training set. Further work combining NLP approaches will attempt to develop a pipeline for autoregistry development from natural language (plain English) queries.
- Research Article
- 10.1145/3715727
- Jun 19, 2025
- Proceedings of the ACM on Software Engineering
Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation. To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.
- Research Article
3
- 10.1145/3728963
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
Recently, large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation, significantly advancing the automation of SE tasks. However, assessing the quality of these LLM-generated code and text remains challenging. The commonly used Pass@k metric necessitates extensive unit tests and configured environments, demands a high labor cost, and is not suitable for evaluating LLM-generated text. Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny. In response, a new trend has emerged to employ LLMs for automated evaluation, known as LLM-as-a-judge. These LLM-as-a-judge methods are claimed to better mimic human assessment than conventional metrics without relying on high-quality reference answers. Nevertheless, their exact human alignment in SE tasks remains unexplored. In this paper, we empirically explore LLM-as-a-judge methods for evaluating SE tasks, focusing on their alignment with human judgments. We select seven LLM-as-a-judge methods that utilize general-purpose LLMs, alongside two LLMs specifically fine-tuned for evaluation. After generating and manually scoring LLM responses on three recent SE datasets of code translation, code generation, and code summarization, we then prompt these methods to evaluate each response. Finally, we compare the scores generated by these methods with human evaluation. The results indicate that output-based methods reach the highest Pearson correlation of 81.32 and 68.51 with human scores in code translation and generation, achieving near-human evaluation, noticeably outperforming ChrF++, one of the best conventional metrics, at 34.23 and 64.92. Such output-based methods prompt LLMs to output judgments directly, and exhibit more balanced score distributions that resemble human score patterns. Finally, we provide insights and implications, concluding that current state-of-the-art LLM-as-a-judge methods can potentially replace human evaluations in certain SE tasks.
- Research Article
1
- 10.3390/electronics13112002
- May 21, 2024
- Electronics
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including conversation, in-context learning, reasoning, and code generation. This paper explores the potential application of LLMs in radiological information systems (RIS) and assesses the impact of integrating LLMs on RIS development and human–computer interaction. We present ChatUI-RIS, a prototype chat-based user interface that leverages LLM capabilities to enhance RIS functionality and user experience. Through an exploratory study involving 26 medical students, we investigate the efficacy of natural language dialogue for learning and operating RIS. Our findings suggest that LLM integration via a chat interface can significantly improve operational efficiency, reduce learning time, and facilitate rapid expansion of RIS capabilities. By interacting with ChatUI-RIS using natural language instructions, medical students can access and retrieve radiology information in a conversational manner. The LLM-powered chat interface not only streamlines user interactions, but also enables more intuitive and efficient navigation of complex RIS functionalities. Furthermore, the natural language processing capabilities of LLMs can be harnessed to automatically generate code snippets and database queries, accelerating RIS development and customization. Preliminary observations indicate that integrating LLMs in RIS has the potential to revolutionize user interface design, enhance system capabilities, and ultimately improve the overall user experience for radiologists and medical professionals.
- Research Article
3
- 10.1145/3770084
- Oct 7, 2025
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs and DSLs face unique challenges, including severe data scarcity and, for DSLs, highly specialized syntax and semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications, including financial and scientific works. While several surveys on LLMs for software engineering and code exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs. Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27,000 published studies from 2020 – 2024 to understand the capabilities and limitations of LLMs in these specialized domains. We also expanded our literature search to include 5 recent papers from 2024 – 2025. We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as strategies used to enhance LLM performance, and the collected datasets and curation methods in this context. We identified four main evaluation techniques used in the literature, along with several metrics to assess code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main groups and summarized the novel methods and architectures proposed by the researchers. We also classified different approaches used for data collection and preparation. While different techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to evaluate code generation in several LRPLs and DSLs. We discuss several distinctions of the studied approaches with the ones used in high-resource programming languages (HRPLs), as well as several challenges unique to these languages, especially DSLs. The challenges stem from the scarcity of data, the unique requirements, and specialized domains, which often need expertise guidelines or domain-specific tools. Accordingly, we provide insights into different research opportunities for the studied aspects. This survey serves as a comprehensive resource for researchers and practitioners working at the intersection of LLMs, software engineering, and specialized programming languages, providing a foundation for future advancements in LRPL and DSL code generation. A GitHub repository was created to organize the papers of this survey at https://github.com/jie-jw-wu/Survey-CodeLLM4LowResource-DSL .
- Research Article
3
- 10.1145/3722108
- Oct 4, 2025
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from Natural Language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. Objective : In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. Method : First, we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code generation prompts. Results : Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks, and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.
- Research Article
1
- 10.47363/jaicc/2023(2)442
- Mar 31, 2023
- Journal of Artificial Intelligence & Cloud Computing
The rapid evolution of Artificial Intelligence (AI) has brought about significant advancements in multiple domains, including software development. One of the most promising innovations is AI-powered code generation through Large Language Models (LLMs), such as OpenAI’s GPT-3 and GPT-4. These models, having been trained on large amounts of programming data, have the ability to produce human-readable code from natural language inputs, which is a big potential for simplifying and optimizing software development processes. The aim of this paper is to analyze the performance of LLMs in automated software development by testing their performance on a variety of tasks such as code generation, debugging, and optimization of software. The research explores both the strengths and weaknesses that these models have to offer, in terms of some of the most important indicators like code quality, generation time, and maintainability of the code. According to our observation, although LLMs hold immense potential to automate mundane programming tasks and enhance developer productivity, they still struggle to cope with more intricate, domain-specific programming tasks involving a higher level of understanding, for example, designing architectures and top-level decision-making. In spite of such shortcomings, LLMs can tremendously enhance software development processes, particularly for small-scale projects or act as helpers for more senior developers. The paper summarizes by reflecting on the potential for LLMs to transform software development processes in the future, while also the importance of the model's reliability, coding quality, and security to be improved if it is to be made applicable to larger, more crucial uses.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.