HyperPlace: Harnessing a Large Language Model for Efficient Hyperparameter Optimization in GPU-Accelerated VLSI Placement
While GPU-based placers have demonstrated significant speed advantages over their CPU-based counterparts, hyperparameter tuning remains a bottleneck, often requiring substantial human intervention and expert knowledge. This challenge is particularly critical given the urgent need for rapid time-to-market solutions. Recently, Large Language Models (LLMs) have exhibited remarkable capabilities in zero-shot learning, context understanding, logical reasoning, and answer generation. In this work, we introduce HyperPlace, an innovative paradigm that leverages an off-the-shelf LLM to automate hyperparameter optimization using in-context learning techniques. Our approach transcends single-output black-box optimization methods by incorporating a batch optimization mechanism that evaluates multiple hyperparameter configurations simultaneously across several GPU computing platforms. We validated the effectiveness of our approach in placement quality, measured by Half-Perimeter Wire Length (HPWL), using DREAMPlace 2.0. To further demonstrate the capability of integrating our framework with other placers, we conducted additional experiments using Xplace 2.0. By employing the ISPD2005 benchmarks for our evaluation, HyperPlace enhances the placement tools with up to a 1.66% reduction in HPWL compared to their published results. Additionally, we evaluated HyperPlace on the ISPD2015 benchmarks, which incorporate fence region constraints not present in ISPD2005 benchmarks. Under these more complex constraints, HyperPlace achieves up to a 22.24% reduction in HPWL compared to the default settings of the placement tools, further demonstrating its adaptability across diverse placement scenarios and benchmark suites.
- Conference Article
8
- 10.2118/217671-ms
- Feb 27, 2024
Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large Language Models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. This paper tests several commercial LLMs for information retrieval tasks for drilling data using zero-shot in-context learning. In addition, we studied the model’s calibration using a few-shot multiple-choice drilling questionnaire. To create an LLM benchmark for drilling, we collated the text data from publicly available databases: the Norwegian Petroleum Directorate (NPD), company annual reports, and petroleum glossary. We used a zero-shot learning technique that relies on an LLM’s ability to generate responses for tasks outside its training. We implemented a controlled zero-shot learning "in-context" procedure that sends a user’s query augmented with text data to the LLM as inputs. This implementation encourages the LLM to take the answer from the data while leveraging its pre-trained contextual-learning capability. We evaluated several state-of-the-art generic LLMs available through an API, including G4, G3.5-TI, J2-ultra model, and L2 series. The paper documents the pre-trained LLMs’ ability to provide correct answers and identify petroleum industry jargon from the collated dataset. Our zero-shot in-context learning implementation helps vanilla LLMs provide relevant factual responses for the drilling domain. While each LLM’s performance varies, we have identified models suitable for a drilling chatbot application. In particular, G4 outperformed on all the tasks. This finding suggests that training expensive domain-specific LLMs is not necessary for question-answering tasks in the context of drilling data. We demonstrate the utility of zero-shot in-context learning using pre-trained LLMs for question-answering tasks relevant to the drilling industry. Additionally, we prepared and publicly released the collated datasets from the NPD database and companies’ annual reports to enable results reproducibility and to foster acceleration of language model adoption and development for the subsurface and drilling industries. The petroleum industry may find our solution beneficial for enhancing personnel training and career development. It also offers a method for conducting data analytics and overcoming challenges in retrieving historical well data.
- Research Article
1
- 10.2118/0125-0092-jpt
- Jan 1, 2025
- Journal of Petroleum Technology
_ This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 217671, “Enhancing Information Retrieval in the Drilling Domain: Zero-Shot Learning With Large Language Models for Question Answering,” by Felix J. Pacis, SPE, University of Stavanger, and Sergey Alyaev and Gilles Pelfrene, SPE, NORCE, et al. The paper has not been peer reviewed. _ Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large language models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert-labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. The complete paper tests several commercial LLMs for information-retrieval tasks for drilling data using zero-shot in-context learning. In addition, the model’s calibration is tested with a few-shot multiple-choice drilling questionnaire. Introduction While LLMs have proven effective in various tasks ranging from sentiment analysis to text completion, using LLMs for question-answering tasks presents a challenge in providing factual responses. Pretrained LLMs only serve as a parameterized implicit knowledge base and cannot access recent data; thus, information is bounded by the time of training. Retrieval augmented generation (RAG) can address some of these issues by extending the utility of LLMs to specific data sources. Fig. 1 shows a simplified RAG-based LLM question/answer application. RAG involves two primary components: document retrieval (green boxes), which retrieves the most relevant context based on the query, and LLM response generation (blue boxes). During the response generation, LLM operates based on the prompt, query, and retrieved context without any change in the model parameters, a process the authors term as “in-context learning.” Methodology Two experiments have been conducted: The first one is a few-shot multiple-choice experiment evaluated using the SLB drilling glossary; the second is a zero-shot in-context experiment evaluated on drilling reports and company reports. Multiple-Choice Experiment. SLB Drilling Glossary. For the multiple-choice experiment, a publicly available drilling glossary served as a basis for evaluation. A total of 409 term/definition pairs were considered. Five term/definition pairs were chosen, serving as few-shot default values, while the remaining 404 pairs served as the multiple-choice questions. Four choices were given for each term/definition question pair, where one was the correct answer. The three incorrect choices were picked randomly from all possible terms minus the true answer. Zero-Shot In-Context Experiment. Norwegian Petroleum Directorate (NPD) Database. The authors explored the wellbore history of all individual exploration wells drilled in the Norwegian shelf in the NPD database. In this experiment, 12 exploration wells were randomly chosen for evaluation. In addition to these drilling reports, information about the stratigraphy of three additional wells was added. Annual Reports. Annual reports of two major operators in Norway for 2020 and 2021 also were considered. These consisted of short summaries that presented the main operational and economic results achieved by the company throughout the year. These reports were added to the evaluation to balance the higher technical content of the wellbore-history reports.
- Research Article
12
- 10.1007/s41666-025-00190-z
- Feb 20, 2025
- Journal of Healthcare Informatics Research
Information extraction (IE) of unstructured electronic health records is challenging due to the semantic complexity of textual data. Generative large language models (LLMs) offer promising solutions to address this challenge. However, identifying the best training methods to adapt LLMs for IE in residential aged care settings remains underexplored. This research addresses this challenge by evaluating the effects of zero-shot and few-shot learning, both with and without parameter-efficient fine-tuning (PEFT) and retrieval-augmented generation (RAG) using Llama 3.1-8B. The study performed named entity recognition (NER) to nursing notes from Australian aged care facilities (RACFs), focusing on agitation in dementia and malnutrition risk factors. Performance evaluation includes accuracy, macro-averaged precision, recall, and F1 score. We used non-parametric statistical methods to compare if the differences were statistically significant. Results show that zero-shot and few-shot learning, whether combined with PEFT or RAG, achieve comparable performance across the clinical domains when the same prompting template is used. Few-shot learning significantly outperforms zero-shot learning when neither PEFT nor RAG is applied. Notably, PEFT significantly improves model performance in both zero-shot and few-shot learning; however, RAG significantly improves performance only in few-shot learning. After PEFT, the performance of zero-shot learning reaches a comparable level with few-shot learning. However, few-shot learning with RAG significantly outperforms zero-shot learning with RAG. We also found a similar level of performance between few-shot learning with RAG and zero-shot learning with PEFT. These findings provide valuable insights for researchers, practitioners, and stakeholders to optimize the use of generative LLMs in clinical IE.
- Conference Article
- 10.1145/3711875.3729128
- Jun 23, 2025
While large language models (LLMs) are endowed with broad knowledge, their task-specific performance is often suboptimal. Fine-tuning LLMs with task-specific data from diverse nodes is necessary, but this data is typically safeguarded and not shared publicly due to privacy concerns. A common solution involves downstream nodes downloading the LLM locally and fine-tuning it with their proprietary data. However, owners often regard pre-trained LLMs as valuable assets and are reluctant to share them. Additionally, the significant computational resources required by LLMs make local fine-tuning impractical for many nodes. To mitigate these problems, this paper proposes CrossLM, a data-free collaborative fine-tuning framework for large and small language models. CrossLM enables resource-constrained nodes to train smaller language models (SLMs) using their private task-specific data. These SLMs are subsequently leveraged to promote the task-specific natural language generation and understanding capabilities of the LLMs. Simultaneously, the SLMs of nodes also benefit from enhancement by the fine-tuned LLMs. In this way, CrossLM avoids sharing private data and proprietary LLMs, and also reduces the resource requirements of nodes. Through extensive experiments across a range of benchmark tasks and popular language models, we demonstrate that CrossLM significantly boosts the task-specific performance of both LLMs and SLMs while preserving the generalization capabilities of LLMs.
- Research Article
4
- 10.1007/s00117-025-01416-2
- Feb 21, 2025
- Radiologie (Heidelberg, Germany)
Given the increasing number of radiological examinations, large language models (LLMs) offer promising support in radiology. Optimized interaction is essential to ensure reliable results. This article provides an overview of interaction techniques such as prompt engineering, zero-shot learning, and retrieval-augmented generation (RAG) and gives practical tips for their application in radiology. Demonstration of interaction techniques based on practical examples with concrete recommendations for their application in routine radiological practice. Advanced interaction techniques allow task-specific adaptation of LLMs without the need for retraining. The creation of precise prompts and the use of zero-shot and few-shot learning can significantly improve response quality. RAG enables the integration of current and domain-specific information into LLM tools, increasing the accuracy and relevance of the generated content. The use of prompt engineering, zero-shot and few-shot learning, and RAG can optimize interaction with LLMs in radiology. Through these targeted strategies, radiologists can efficiently integrate general chatbots into routine practice to improve patient care.
- Research Article
20
- 10.1016/j.ipm.2024.103973
- Dec 3, 2024
- Information Processing and Management
Are large language models qualified reviewers in originality evaluation?
- Research Article
9
- 10.1016/j.artmed.2025.103268
- Dec 1, 2025
- Artificial intelligence in medicine
A survey for large language models in biomedicine.
- Conference Article
32
- 10.1145/3589334.3645627
- May 13, 2024
Recently, large language models (LLMs) have demonstrated superior capabilities in understanding and zero-shot learning on textual data, promising significant advances for many text-related domains. In the graph domain, various real-world scenarios also involve textual data, where tasks and node features can be described by text. These text-attributed graphs (TAGs) have broad applications in social media, recommendation systems, etc. Thus, this paper explores how to utilize LLMs to model TAGs. Previous methods for TAG modeling are based on million-scale LMs. When scaled up to billion-scale LLMs, they face huge challenges in computational costs. Additionally, they also ignore the zero-shot inference capabilities of LLMs. Therefore, we propose GraphAdapter, which uses a graph neural network (GNN) as an efficient adapter in collaboration with LLMs to tackle TAGs. In terms of efficiency, the GNN adapter introduces only a few trainable parameters and can be trained with low computation costs. The entire framework is trained using auto-regression on node text (next token prediction). Once trained, GraphAdapter can be seamlessly fine-tuned with task-specific prompts for various downstream tasks. Through extensive experiments across multiple real-world TAGs, GraphAdapter based on Llama 2 gains an average improvement of approximately 5% in terms of node classification. Furthermore, GraphAdapter can also adapt to other language models, including RoBERTa, GPT-2. The promising results demonstrate that GNNs can serve as effective adapters for LLMs in TAG modeling.
- Research Article
4
- 10.1001/jamanetworkopen.2025.12032
- May 22, 2025
- JAMA Network Open
An estimated half of all long-term care facility (LTCF) residents are colonized with antimicrobial-resistant organisms, and early identification of these patients on admission to acute care hospitals is a core strategy for preventing intrahospital spread. However, because LTCF exposure is not reliably captured in structured electronic health record data, LTCF-exposed patients routinely go undetected. Large language models (LLMs) offer a promising, but untested, opportunity for extracting this information from patient admission histories. To evaluate the performance of an LLM against human review for identifying recent LTCF exposure from identifiable patient admission histories. This cross-sectional, multicenter study used the history and physical (H&P) notes from unique, randomly sampled adult admissions occurring between January 1, 2016, and December 31, 2021, at 13 hospitals in the University of Maryland Medical System (UMMS) and the John Hopkins (Hopkins) health care system to compare the performance of an LLM (GPT-4-Turbo) using zero-shot learning and prompting against humans in identifying patients with recent LTCF exposure. LLM analyses were conducted from August to September 2024. Recent (≤12 months) LTCF exposure documented in the H&P note, as adjudicated by (1) humans and (2) an LLM. LLM sensitivity and specificity with Clopper-Pearson 95% CIs. Secondary outcomes were note review time and cost. The LLM was also prompted to provide a rationale and supporting note-text for each classification. The study included 359 601 eligible adult admissions, of which 2087 randomly sampled H&P notes were manually reviewed at UMMS (1020 individuals; median [IQR] age, 58 [41-71] years; 493 [48%] male) and Hopkins (1067 individuals; median [IQR] age, 58 [48-67] years; 561 [53%] male) for LTCF residence. Compared with human review, the LLM achieved a sensitivity of 97% (95% CI, 91%-100%) and a specificity of 98% (95% CI, 97%-99%) at UMMS, and 96% (95% CI, 86%-100%) and 93% (95% CI, 92%-95%) sensitivity and specificity, respectively, at Hopkins; specificity at Hopkins improved with prompt revision (96% [95% CI, 95%-97%]). Of 117 manually reviewed LLM rationales, all were factually correct and quoted note-text accurately, and some demonstrated inferential logic and external knowledge. The LLM identified 37 (1.8%) human errors. Human review time had a mean of 2.5 minutes and cost $0.63 to $0.83 per note vs a mean of 4 to 6 seconds and $0.03 per note for LLM review. In this 13-hospital study of 2087 adult admissions, an LLM accurately identified LTCF residence from H&P notes and was more than 25 times faster and 20 times less expensive than human review.
- Research Article
2
- 10.1609/aaai.v39i1.32046
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on programming assignments requires substantial time and effort from faculty, personalized automated feedback often involves suggesting repairs to students' programs. Symbolic semantic repair approaches, which rely on Formal Methods (FM), check a program's execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for program repair but often make extensive rewrites instead of minimal adjustments. This tends to lead to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FM-based fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a Counterexample Guided Inductive Synthesis (CEGIS) loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM for revised synthesis. Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs and produce smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.
- Conference Article
6
- 10.1145/3627673.3679830
- Oct 21, 2024
Text-Attributed Graphs (TAGs) are graphs of connected textual documents. Graph models can efficiently learn TAGs, but their training heavily relies on human-annotated labels, which are scarce or even unavailable in many applications. Large language models (LLMs) have recently demonstrated remarkable capabilities in few-shot and zero-shot TAG learning, but they suffer from scalability, cost, and privacy issues. Therefore, in this work, we focus on synergizing LLMs and graph models with their complementary strengths by distilling the power of LLMs into a local graph model on TAG learning. To address the inherent gaps between LLMs (generative models for texts) and graph models (discriminative models for graphs), we propose first to let LLMs teach an interpreter with rich rationale and then let a student model mimic the interpreter's reasoning without LLMs' rationale. We convert LLM's textual rationales to multi-level graph rationales to train the interpreter model and align the student model with the interpreter model based on the features of TAGs. Extensive experiments validate the efficacy of our proposed framework.
- Research Article
18
- 10.1093/bib/bbae354
- Jul 25, 2024
- Briefings in bioinformatics
Large language models (LLMs) are sophisticated AI-driven models trained on vast sources of natural language data. They are adept at generating responses that closely mimic human conversational patterns. One of the most notable examples is OpenAI's ChatGPT, which has been extensively used across diverse sectors. Despite their flexibility, a significant challenge arises as most users must transmit their data to the servers of companies operating these models. Utilizing ChatGPT or similar models online may inadvertently expose sensitive information to the risk of data breaches. Therefore, implementing LLMs that are open source and smaller in scale within a secure local network becomes a crucial step for organizations where ensuring data privacy and protection has the highest priority, such as regulatory agencies. As a feasibility evaluation, we implemented a series of open-source LLMs within a regulatory agency's local network and assessed their performance on specific tasks involving extracting relevant clinical pharmacology information from regulatory drug labels. Our research shows that some models work well in the context of few- or zero-shot learning, achieving performance comparable, or even better than, neural network models that needed thousands of training samples. One of the models was selected to address a real-world issue of finding intrinsic factors that affect drugs' clinical exposure without any training or fine-tuning. In a dataset of over 700000 sentences, the model showed a 78.5% accuracy rate. Our work pointed to the possibility of implementing open-source LLMs within a secure local network and using these models to perform various natural language processing tasks when large numbers of training examples are unavailable.
- Research Article
- 10.1016/j.jbi.2026.105034
- Mar 27, 2026
- Journal of biomedical informatics
A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
- Research Article
43
- 10.1055/a-2264-5631
- Feb 26, 2024
- RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
Large language models (LLMs) such as ChatGPT have shown significant potential in radiology. Their effectiveness often depends on prompt engineering, which optimizes the interaction with the chatbot for accurate results. Here, we highlight the critical role of prompt engineering in tailoring the LLMs' responses to specific medical tasks. Using a clinical case, we elucidate different prompting strategies to adapt the LLM ChatGPT using GPT4 to new tasks without additional training of the base model. These approaches range from precision prompts to advanced in-context methods such as few-shot and zero-shot learning. Additionally, the significance of embeddings, which serve as a data representation technique, is discussed. Prompt engineering substantially improved and focused the chatbot's output. Moreover, embedding of specialized knowledge allows for more transparent insight into the model's decision-making and thus enhances trust. Despite certain challenges, prompt engineering plays a pivotal role in harnessing the potential of LLMs for specialized tasks in the medical domain, particularly radiology. As LLMs continue to evolve, techniques like few-shot learning, zero-shot learning, and embedding-based retrieval mechanisms will become indispensable in delivering tailored outputs. · Large language models might impact radiological practice and decision-masking.. · However, implementation and performance are dependent on the assigned task.. · Optimization of prompting strategies can substantially improve model performance.. · Strategies for prompt engineering range from precision prompts to zero-shot learning.. · Russe MF, Reisert M, Bamberg F et al. Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning . Fortschr Röntgenstr 2024; 196: 1166 - 1170.
- Research Article
3
- 10.26803/ijlter.23.12.9
- Dec 30, 2024
- International Journal of Learning, Teaching and Educational Research
A lot of hype has accompanied the increasing number of generative artificial intelligence-powered large language models (LLMs). Similarly, much has been written about what currently available LLMs can and cannot do, including their benefits and risks, especially in higher education. However, few use cases have investigated the performance and generative capabilities of LLMs in low-resource languages. With this in mind, one of the purposes of the current study was to explore the extent to which seven, currently available, free-to-use versions of LLMs (ChatGPT, Claude, Copilot, Gemini, GroqChat, Perplexity, and YouChat) perform in five low-resource languages (isiZulu, Sesotho, Yoruba, M?ori, and Mi’kmaq) in their generative multilingual capabilities. Employing a common input prompt, in which the only change was to insert the name of a given low-resource language and English in each case, this study collected its datasets by inputting this common prompt into the seven LLMs. Three of the findings of this study are noteworthy. First, the seven LLMs displayed a significant lack of generative multilingual capabilities in the five low-resource languages. Second, they hallucinated and produced nonsensical, meaningless, and irrelevant responses in their low-resource language outputs. Third, their English responses were far better in quality, relevance, depth, detail, and nuance than their low-resource language only and English responses for the five low-resource languages. The paper ends by offering the implications and making the conclusions of the study in terms of LLMs’ generative capabilities in low-resource languages.