VerLog: Enhancing Release Note Generation for Android Apps using Large Language Models
Release notes are essential documents that communicate the details of software updates to users and developers, yet their generation remains a time-consuming and error-prone process. In this paper, we present VerLog, a novel technique that enhances the generation of software release notes using Large Language Models (LLMs). VerLog leverages few-shot in-context learning with adaptive prompting to facilitate the graph reasoning capabilities of LLMs, enabling them to accurately interpret and document the semantic information of code changes. Additionally, VerLog incorporates multi-granularity information, including fine-grained code modifications and high-level non-code artifacts, to guide the generation process and ensure comprehensive, accurate, and readable release notes. We applied VerLog to the 42 releases of 248 unique Android applications and conducted extensive evaluations. Our results demonstrate that VerLog significantly (up to 18%–21% higher precision, recall, and F1) outperforms state-of-the-art baselines in terms of completeness, accuracy, readability, and overall quality of the generated release notes, in both controlled experiments with high-quality reference release notes and in-the-wild evaluations.
- Conference Article
- 10.1145/3711875.3729128
- Jun 23, 2025
While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.
- Research Article
106
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
43
- 10.1055/a-2264-5631
- Feb 26, 2024
- RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
Large language models (LLMs) such as ChatGPT have shown significant potential in radiology. Their effectiveness often depends on prompt engineering, which optimizes the interaction with the chatbot for accurate results. Here, we highlight the critical role of prompt engineering in tailoring the LLMs' responses to specific medical tasks. Using a clinical case, we elucidate different prompting strategies to adapt the LLM ChatGPT using GPT4 to new tasks without additional training of the base model. These approaches range from precision prompts to advanced in-context methods such as few-shot and zero-shot learning. Additionally, the significance of embeddings, which serve as a data representation technique, is discussed. Prompt engineering substantially improved and focused the chatbot's output. Moreover, embedding of specialized knowledge allows for more transparent insight into the model's decision-making and thus enhances trust. Despite certain challenges, prompt engineering plays a pivotal role in harnessing the potential of LLMs for specialized tasks in the medical domain, particularly radiology. As LLMs continue to evolve, techniques like few-shot learning, zero-shot learning, and embedding-based retrieval mechanisms will become indispensable in delivering tailored outputs. · Large language models might impact radiological practice and decision-masking.. · However, implementation and performance are dependent on the assigned task.. · Optimization of prompting strategies can substantially improve model performance.. · Strategies for prompt engineering range from precision prompts to zero-shot learning.. · Russe MF, Reisert M, Bamberg F et al. Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning . Fortschr Röntgenstr 2024; 196: 1166 - 1170.
- Research Article
3
- 10.1109/access.2024.3419079
- Jan 1, 2024
- IEEE Access
Large language models’ exceptional all-purpose abilities have made human-computer conversations normal, but for particular industries and verticals, they fall short of enhancing the expertise of knowledge and the timeliness of information. In order to give current information, and provide improved search capabilities, large language models need to increasingly incorporate specialist resources and databases. In this research, a model for intelligent assisted decision-making was proposed that the model incorporates knowledge from domain-specific databases and real-time data and uses large language models to offer expert tax guidance. The research proposed to overcome the limits of general-purpose language models and deliver specialized advise for tax-related inquiries by complementing large language models with domain-specific information.The results we achieve demonstrate that by offering tax advice tailored to a given situation, and the model we proposed goes beyond the validity of general large language language models. Our contribution is that not only exploring the combination of tax area and large language model, but also proposing a new effective model for government tax department to use in real life. This study highlights the potential of big language models for use in real-world professional domains and advances the field of domain-specific human-computer interaction.
- Research Article
4
- 10.1145/3735129
- Jan 21, 2026
- ACM Transactions on Software Engineering and Methodology
Within the realm of software engineering, specialized tasks on code, such as program repair, present unique challenges, necessitating fine-tuning Large language models (LLMs) to unlock state-of-the-art performance. Fine-tuning approaches proposed in the literature for LLMs on program repair tasks generally overlook the need to reason about the logic behind code changes, beyond syntactic patterns in the data. High-performing fine-tuning experiments also usually come at very high computational costs. With MORepair , we propose a novel perspective on the learning focus of LLM fine-tuning for program repair: we not only adapt the LLM parameters to the syntactic nuances of the task of code transformation (objective ➊), but we also specifically fine-tune the LLM with respect to the logical reason behind the code change in the training data (objective ➋). Such a multi-objective fine-tuning will instruct LLMs to generate high-quality patches. We apply MORepair to fine-tune four open-source LLMs with different sizes and architectures. Experimental results on function-level and repository-level repair benchmarks show that the implemented fine-tuning effectively boosts LLM repair performance by 11.4% to 56.0%. We further show that our fine-tuning strategy yields superior performance compared to the state-of-the-art approaches, including standard fine-tuning, Fine-tune-CoT, and RepairLLaMA.
- Research Article
27
- 10.1145/3709358
- Jul 1, 2025
- ACM Transactions on Software Engineering and Methodology
Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \(>\) 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but Llama 2 and Code Llama families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.
- Research Article
- 10.1038/s41746-026-02588-4
- Apr 1, 2026
- NPJ digital medicine
Exploring large language models (LLMs) performance in the specific medical domain can help understand their generalizability in real-world application. We assessed the predictive and decision-support value of two state-of-the-art LLMs in predicting bone cement leakage (BCL) and new vertebral fractures (NVF) after percutaneous kyphoplasty (PKP) and to compare them with those of traditional machine learning (TML) and spine surgeon. This study utilized combined retrospective and prospective data at a single tertiary hospital. Two LLMs (GPT-5 and DeepSeek R1) with zero- and few-shot strategy, five TML models, and two spine surgeons with/without exposure to LLM responses, were asked to predict complications based on demographic, perioperative baseline, and radiographic data. We also tested LLMs' ability to predict complication subtype. For BCL prediction, both LLMs demonstrated acceptable performance (F1-score, 0.857-0.871; MCC, 0.164-0.332) under zero-shot conditions, comparable to TML models (F1-score, 0.758-0.867; MCC, 0.265-0.416), and slightly superior to surgeons alone (F1-score, 0.675-0.684; MCC, 0.074-0.185). Few-shot prompting enhanced specificity but yielded uncertain overall gains. For NVF prediction, the zero-shot LLM performance was poor (F1-score, 0.309; MCC, 0.044) but improved with few-shot learning. The RBF-SVM model showed the best performance for NVF prediction (F1-score, 0.536; MCC, 0.414). LLM explanations enhanced surgeon performance in BCL prediction but not in NVF. LLMs showed poor prediction of complication subtypes. The findings suggest that current LLMs hold diverse predictive performances for different complications after PKP, they are still immature for real clinical applicability and need further improvement.
- Research Article
61
- 10.1016/j.jbi.2024.104630
- Mar 26, 2024
- Journal of Biomedical Informatics
Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction
- Research Article
21
- 10.1093/jamia/ocae090
- Jul 1, 2024
- Journal of the American Medical Informatics Association : JAMIA
Large language models (LLMs) have demonstrated remarkable generalization and across diverse tasks, leading individuals to increasingly use them as personal assistants due to their emerging reasoning capabilities. Nevertheless, a notable obstacle emerges when including numerical/temporal data into these prompts, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. This article discusses the challenges of representing and tokenizing temporal data. It argues that naively passing timeseries to LLMs can be ineffective due to the modality gap between numbers and text. We conduct a case study by tokenizing a sample mobile sensing dataset using the OpenAI tokenizer. We also review recent works that feed timeseries data into LLMs for human-centric tasks, outlining common experimental setups like zero-shot prompting and few-shot learning. The case study shows that popular LLMs split timestamps and sensor values into multiple nonmeaningful tokens, indicating they struggle with temporal data. We find that preliminary works rely heavily on prompt engineering and timeseries aggregation to "ground" LLMs, hinting that the "modality gap" hampers progress. The literature was critically analyzed through the lens of models optimizing for expressiveness versus parameter efficiency. On one end of the spectrum, training large domain-specific models from scratch is expressive but not parameter-efficient. On the other end, zero-shot prompting of LLMs is parameter-efficient but lacks expressiveness for temporal data. We argue tokenizers are not optimized for numerical data, while the scarcity of timeseries examples in training corpora exacerbates difficulties. We advocate balancing model expressiveness and computational efficiency when integrating temporal data. Prompt tuning, model grafting, and improved tokenizers are highlighted as promising directions. We underscore that despite promising capabilities, LLMs cannot meaningfully process temporal data unless the input representation is addressed. We argue that this paradigm shift in how we leverage pretrained models will particularly affect the area of biomedical signals, given the lack of modality-specific foundation models.
- Research Article
- 10.28945/5693
- Jan 1, 2026
- Journal of Information Technology Education: Research
Aim/Purpose: The study investigates the factors influencing the acceptance and utilisation of large language models (LLMs) (predictor variables of LLM usage), such as ChatGPT, in Learning design by instructional designers and university-teaching academics from various countries. Background: Large language models (LLMs) have exploded onto the scene, transforming the landscape of learning design. Instructional designers and university teaching academics have been overburdened with content creation for their teaching programmes, and the arrival of LLM models will help in this regard by developing more interactive content that drives student engagement and, in turn, contributes to student success. Since LLMs are a relatively new phenomenon, little is known about the factors influencing their acceptance in learning design; therefore, this research is needed, as learning design principles are the bedrock of student engagement and success. Methodology: A cross-sectional correlational quantitative study was employed. Data was collected using an online questionnaire posted on social media, including LinkedIn, from 203 instructional designers and university teaching academics. Purposive and snowball sampling methods were used to target instructional designers and university teaching academics at colleges and universities worldwide. Participants were asked to share the survey link with fellow instructional designers and university-teaching academics in their communities. The factor structure of the data was determined using exploratory factor analysis. Nonetheless, the factor structure derived from the LLMs did not entirely reflect the original configuration of the Unified Theory of Acceptance and Use of Technology (UTAUT3), as certain predictors appeared to coalesce, indicating LLMs’ unique nature in learning design. Confirmatory factor analysis was used to verify the fit of the data on the measurement model. First-order and second-order structural modelling were used to identify the structural relationships among the variables. Contribution: The study determines significant factors for the acceptance of LLMs by instructional designers and academic teaching staff in learning design, enabling possible opportunities for best practices in the field through interventions to optimize LLM usage. The study applies the technology acceptance model to the emerging LLM technology and extends the technology acceptance model by adding the trust construct as a predictor variable. Findings: The structural analysis results indicated that the ingrained LLM practices, LLM peer-driven expectations, innovative propensity towards LLM adoption, reliability and provider trust in LLMs, and ease of use and support influenced perceived LLM benefits and usage, but community standards and infrastructure had no influence. The second-order structural equation modelling indicated that perceived LLM benefits and usage and ingrained LLM habits contributed most to the learning design. Recommendations for Practitioners: Teaching academics and instructional designers must use LLMs in designing content, assessments, and interactive learning activities, and attend LLM training workshops on prompting and best practices in integrating LLMs into learning and teaching to see their benefits; hence, regular use of LLMs will then lead to trust and innovation in LLMs usage, enhancing learning design and improving student learning outcomes. Recommendation for Researchers: Researchers must use mixed methods approaches to have a deeper understanding of the factors influencing LLMs. Since habit and perceived LLM benefits and usage contributed the most variance to learning design, researchers must investigate strategies that optimise these factors in learning design, such as effective intervention strategies that can help form positive LLM habits. In addition, the findings provide researchers with a starting point for future research. Further researchers must investigate interventions that optimise the influence of personal innovativeness and trust that contributed the least variance to learning design, hence unlocking the potential of LLMs in learning design through innovation, responsible, and ethical use. Impact on Society: The use of LLMs in learning design has a high possibility of transforming education, specifically the learning design landscape. Using LLMs will free up more time for teaching academics and instructional designers so that they spend more time on higher-order thinking skill demands. Consequently, the students will be exposed to more engaging and interactive content, resulting in improved learning outcomes. Future Research: Future research must include context-derived external variables in technology acceptance models, such as levels of prompting competencies, to provide a deeper understanding of LLMs. In addition, future research must be based on the application and impact of LLMs on student engagement and success, and their attainment of 21st-century skills.
- Research Article
1
- 10.2196/64723
- Oct 15, 2025
- JMIR Formative Research
BackgroundIn the digital age, social media has become a crucial platform for public discourse on diverse health-related topics, including vaccines. Efficient sentiment analysis and hesitancy detection are essential for understanding public opinions and concerns. Large language models (LLMs) offer advanced capabilities for processing complex linguistic patterns, potentially providing valuable insights into vaccine-related discourse.ObjectiveThis study aims to evaluate the performance of various LLMs in sentiment analysis and hesitancy detection related to vaccine discussions on social media and identify the most efficient, accurate, and cost-effective model for detecting vaccine-related public sentiment and hesitancy trends.MethodsWe used several LLMs—generative pretrained transformer (GPT-3.5), GPT-4, Claude-3 Sonnet, and Llama 2—to process and classify complex linguistic data related to human papillomavirus; measles, mumps, and rubella; and vaccines overall from X (formerly known as Twitter), Reddit, and YouTube. The models were tested across different learning paradigms: zero-shot, 1-shot, and few-shot to determine their adaptability and learning efficiency with varying amounts of training data. We evaluated the models’ performance using accuracy, F1-score, precision, and recall. In addition, we conducted a cost analysis focused on token usage to assess the computational efficiency of each approach.ResultsGPT-4 (F1-score=0.85 and accuracy=0.83) outperformed GPT-3.5, Llama 2, and Claude-3 Sonnet across various metrics, regardless of the sentiment type or learning paradigm. Few-shot learning did not significantly enhance performance compared with the zero-shot paradigm. Moreover, the increased computational costs and token usage associated with few-shot learning did not justify its application, given the marginal improvement in model performance. The analysis highlighted challenges in classifying neutral sentiments and convenience, correctly interpreting sarcasm, and accurately identifying indirect expressions of vaccine hesitancy, emphasizing the need for model refinement.ConclusionsGPT-4 emerged as the most accurate model, excelling in sentiment and hesitancy analysis. Performance differences between learning paradigms were minimal, making zero-shot learning preferable for its balance of accuracy and computational efficiency. However, the zero-shot GPT-4 model is not the most cost-effective compared with traditional machine learning. A hybrid approach, using LLMs for initial annotation and traditional models for training, could optimize cost and performance. Despite reliance on specific LLM versions and a limited focus on certain vaccine types and platforms, our findings underscore the capabilities and limitations of LLMs in vaccine sentiment and hesitancy analysis, highlighting the need for ongoing evaluation and adaptation in public health communication strategies.
- Research Article
- 10.1016/j.ijmedinf.2025.106230
- Mar 1, 2026
- International journal of medical informatics
Named entity recognition (NER) is critical in natural language processing (NLP), particularly in the medical field, where accurate identification of entities, such as patient information and clinical events, is essential. Traditional NER approaches rely heavily on large, annotated corpora, which are resource intensive. Large language models (LLMs) offer new NER approaches, particularly through in-context and few-shot learning. This study investigates the effects of incorporating annotation guidelines into prompts for NER via LLMs, with a specific focus on their impact on few-shot learning performance across various medical corpora. We designed eight different prompt patterns, combining few-shot examples with annotation guidelines of varying complexity, and evaluated their performance via three prominent LLMs: GPT-4o, Claude 3.5 Sonnet, and gpt-oss-120b. Additionally, we employed three diverse medical corpora: i2b2-2014, i2b2-2012, and MedTxt-CR. Accuracy was assessed via precision, recall, and the F1 score, with evaluation methods aligned with those used in relevant shared tasks to ensure the comparability of the results. Our findings indicate that adding detailed annotation guidelines to few-shot prompts improves the recall and F1 score in most cases. Including annotation guidelines in prompts enhances the performance of LLMs in NER tasks, making this a practical approach for developing accurate NLP systems in resource-constrained environments. Although annotation guidelines are essential for evaluation and example creation, their integration into LLM prompts can further optimize few-shot learning, especially within specialized domains such as medical NLP.
- Conference Article
1
- 10.1145/3696630.3728560
- Jun 23, 2025
Aerospace software presents significant challenges to requirements engineering due to its design complexity and stringent safety standards. When manually drafting requirement documents, engineers need strong domain knowledge while also navigating heterogeneous data, which leads to errors and inefficiencies. This paper evaluates the capabilities of large language models (LLMs) in understanding aerospace software requirements and their potential to assist in requirements question answering (QA). We develop an aerospace requirements QA benchmark based on industrial software assets, books, and research materials, creating a total of 6, 696 QA pairs across ten tasks and three heterogeneous data formats: text, tables, and formulas. We then evaluate the domain-specific performance of five mainstream open-source LLMs using zero-shot learning, few-shot learning, and retrieval-augmented generation (RAG) techniques. We further categorize hallucinations from LLMs and quantitatively analyze error distributions. Moreover, we conduct a user study to assess the LLM's practical usefulness when applying to requirements QA. The evaluation results show that (1) LLMs demonstrate limited performance in the aerospace software domain, (2) RAG techniques significantly enhance the capabilities of LLMs for text-based tasks, while few-shot learning improves the performance of most LLMs, (3) four distinct types of QA hallucinations are identified, and (4) LLM QA is particularly beneficial for junior engineers. This research provides valuable perspectives for the future application of LLMs in aerospace software.
- Research Article
- 10.3348/kjr.2025.1045
- Jan 1, 2026
- Korean journal of radiology
To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.