Zero-Shot Learning With Large Language Models Enhances Drilling-Information Retrieval
_ This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 217671, “Enhancing Information Retrieval in the Drilling Domain: Zero-Shot Learning With Large Language Models for Question Answering,” by Felix J. Pacis, SPE, University of Stavanger, and Sergey Alyaev and Gilles Pelfrene, SPE, NORCE, et al. The paper has not been peer reviewed. _ Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large language models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert-labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. The complete paper tests several commercial LLMs for information-retrieval tasks for drilling data using zero-shot in-context learning. In addition, the model’s calibration is tested with a few-shot multiple-choice drilling questionnaire. Introduction While LLMs have proven effective in various tasks ranging from sentiment analysis to text completion, using LLMs for question-answering tasks presents a challenge in providing factual responses. Pretrained LLMs only serve as a parameterized implicit knowledge base and cannot access recent data; thus, information is bounded by the time of training. Retrieval augmented generation (RAG) can address some of these issues by extending the utility of LLMs to specific data sources. Fig. 1 shows a simplified RAG-based LLM question/answer application. RAG involves two primary components: document retrieval (green boxes), which retrieves the most relevant context based on the query, and LLM response generation (blue boxes). During the response generation, LLM operates based on the prompt, query, and retrieved context without any change in the model parameters, a process the authors term as “in-context learning.” Methodology Two experiments have been conducted: The first one is a few-shot multiple-choice experiment evaluated using the SLB drilling glossary; the second is a zero-shot in-context experiment evaluated on drilling reports and company reports. Multiple-Choice Experiment. SLB Drilling Glossary. For the multiple-choice experiment, a publicly available drilling glossary served as a basis for evaluation. A total of 409 term/definition pairs were considered. Five term/definition pairs were chosen, serving as few-shot default values, while the remaining 404 pairs served as the multiple-choice questions. Four choices were given for each term/definition question pair, where one was the correct answer. The three incorrect choices were picked randomly from all possible terms minus the true answer. Zero-Shot In-Context Experiment. Norwegian Petroleum Directorate (NPD) Database. The authors explored the wellbore history of all individual exploration wells drilled in the Norwegian shelf in the NPD database. In this experiment, 12 exploration wells were randomly chosen for evaluation. In addition to these drilling reports, information about the stratigraphy of three additional wells was added. Annual Reports. Annual reports of two major operators in Norway for 2020 and 2021 also were considered. These consisted of short summaries that presented the main operational and economic results achieved by the company throughout the year. These reports were added to the evaluation to balance the higher technical content of the wellbore-history reports.
- Conference Article
3
- 10.2118/217671-ms
- Feb 27, 2024
Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large Language Models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. This paper tests several commercial LLMs for information retrieval tasks for drilling data using zero-shot in-context learning. In addition, we studied the model’s calibration using a few-shot multiple-choice drilling questionnaire. To create an LLM benchmark for drilling, we collated the text data from publicly available databases: the Norwegian Petroleum Directorate (NPD), company annual reports, and petroleum glossary. We used a zero-shot learning technique that relies on an LLM’s ability to generate responses for tasks outside its training. We implemented a controlled zero-shot learning "in-context" procedure that sends a user’s query augmented with text data to the LLM as inputs. This implementation encourages the LLM to take the answer from the data while leveraging its pre-trained contextual-learning capability. We evaluated several state-of-the-art generic LLMs available through an API, including G4, G3.5-TI, J2-ultra model, and L2 series. The paper documents the pre-trained LLMs’ ability to provide correct answers and identify petroleum industry jargon from the collated dataset. Our zero-shot in-context learning implementation helps vanilla LLMs provide relevant factual responses for the drilling domain. While each LLM’s performance varies, we have identified models suitable for a drilling chatbot application. In particular, G4 outperformed on all the tasks. This finding suggests that training expensive domain-specific LLMs is not necessary for question-answering tasks in the context of drilling data. We demonstrate the utility of zero-shot in-context learning using pre-trained LLMs for question-answering tasks relevant to the drilling industry. Additionally, we prepared and publicly released the collated datasets from the NPD database and companies’ annual reports to enable results reproducibility and to foster acceleration of language model adoption and development for the subsurface and drilling industries. The petroleum industry may find our solution beneficial for enhancing personnel training and career development. It also offers a method for conducting data analytics and overcoming challenges in retrieving historical well data.
- Research Article
- 10.1007/s41666-025-00190-z
- Feb 20, 2025
- Journal of Healthcare Informatics Research
Information extraction (IE) of unstructured electronic health records is challenging due to the semantic complexity of textual data. Generative large language models (LLMs) offer promising solutions to address this challenge. However, identifying the best training methods to adapt LLMs for IE in residential aged care settings remains underexplored. This research addresses this challenge by evaluating the effects of zero-shot and few-shot learning, both with and without parameter-efficient fine-tuning (PEFT) and retrieval-augmented generation (RAG) using Llama 3.1-8B. The study performed named entity recognition (NER) to nursing notes from Australian aged care facilities (RACFs), focusing on agitation in dementia and malnutrition risk factors. Performance evaluation includes accuracy, macro-averaged precision, recall, and F1 score. We used non-parametric statistical methods to compare if the differences were statistically significant. Results show that zero-shot and few-shot learning, whether combined with PEFT or RAG, achieve comparable performance across the clinical domains when the same prompting template is used. Few-shot learning significantly outperforms zero-shot learning when neither PEFT nor RAG is applied. Notably, PEFT significantly improves model performance in both zero-shot and few-shot learning; however, RAG significantly improves performance only in few-shot learning. After PEFT, the performance of zero-shot learning reaches a comparable level with few-shot learning. However, few-shot learning with RAG significantly outperforms zero-shot learning with RAG. We also found a similar level of performance between few-shot learning with RAG and zero-shot learning with PEFT. These findings provide valuable insights for researchers, practitioners, and stakeholders to optimize the use of generative LLMs in clinical IE.
- Research Article
- 10.1007/s00117-025-01416-2
- Feb 21, 2025
- Radiologie (Heidelberg, Germany)
Given the increasing number of radiological examinations, large language models (LLMs) offer promising support in radiology. Optimized interaction is essential to ensure reliable results. This article provides an overview of interaction techniques such as prompt engineering, zero-shot learning, and retrieval-augmented generation (RAG) and gives practical tips for their application in radiology. Demonstration of interaction techniques based on practical examples with concrete recommendations for their application in routine radiological practice. Advanced interaction techniques allow task-specific adaptation of LLMs without the need for retraining. The creation of precise prompts and the use of zero-shot and few-shot learning can significantly improve response quality. RAG enables the integration of current and domain-specific information into LLM tools, increasing the accuracy and relevance of the generated content. The use of prompt engineering, zero-shot and few-shot learning, and RAG can optimize interaction with LLMs in radiology. Through these targeted strategies, radiologists can efficiently integrate general chatbots into routine practice to improve patient care.
- Research Article
- 10.1001/jamanetworkopen.2025.12032
- May 22, 2025
- JAMA Network Open
An estimated half of all long-term care facility (LTCF) residents are colonized with antimicrobial-resistant organisms, and early identification of these patients on admission to acute care hospitals is a core strategy for preventing intrahospital spread. However, because LTCF exposure is not reliably captured in structured electronic health record data, LTCF-exposed patients routinely go undetected. Large language models (LLMs) offer a promising, but untested, opportunity for extracting this information from patient admission histories. To evaluate the performance of an LLM against human review for identifying recent LTCF exposure from identifiable patient admission histories. This cross-sectional, multicenter study used the history and physical (H&P) notes from unique, randomly sampled adult admissions occurring between January 1, 2016, and December 31, 2021, at 13 hospitals in the University of Maryland Medical System (UMMS) and the John Hopkins (Hopkins) health care system to compare the performance of an LLM (GPT-4-Turbo) using zero-shot learning and prompting against humans in identifying patients with recent LTCF exposure. LLM analyses were conducted from August to September 2024. Recent (≤12 months) LTCF exposure documented in the H&P note, as adjudicated by (1) humans and (2) an LLM. LLM sensitivity and specificity with Clopper-Pearson 95% CIs. Secondary outcomes were note review time and cost. The LLM was also prompted to provide a rationale and supporting note-text for each classification. The study included 359 601 eligible adult admissions, of which 2087 randomly sampled H&P notes were manually reviewed at UMMS (1020 individuals; median [IQR] age, 58 [41-71] years; 493 [48%] male) and Hopkins (1067 individuals; median [IQR] age, 58 [48-67] years; 561 [53%] male) for LTCF residence. Compared with human review, the LLM achieved a sensitivity of 97% (95% CI, 91%-100%) and a specificity of 98% (95% CI, 97%-99%) at UMMS, and 96% (95% CI, 86%-100%) and 93% (95% CI, 92%-95%) sensitivity and specificity, respectively, at Hopkins; specificity at Hopkins improved with prompt revision (96% [95% CI, 95%-97%]). Of 117 manually reviewed LLM rationales, all were factually correct and quoted note-text accurately, and some demonstrated inferential logic and external knowledge. The LLM identified 37 (1.8%) human errors. Human review time had a mean of 2.5 minutes and cost $0.63 to $0.83 per note vs a mean of 4 to 6 seconds and $0.03 per note for LLM review. In this 13-hospital study of 2087 adult admissions, an LLM accurately identified LTCF residence from H&P notes and was more than 25 times faster and 20 times less expensive than human review.
- Research Article
- 10.1609/aaai.v39i1.32046
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments in programming courses each year. Since providing feedback on programming assignments requires substantial time and effort from faculty, personalized automated feedback often involves suggesting repairs to students' programs. Symbolic semantic repair approaches, which rely on Formal Methods (FM), check a program's execution against a test suite or reference solution, are effective but limited. These tools excel at identifying buggy parts but can only fix programs if the correct implementation and the faulty one share the same control flow graph. Conversely, Large Language Models (LLMs) are used for program repair but often make extensive rewrites instead of minimal adjustments. This tends to lead to more invasive fixes, making it harder for students to learn from their mistakes. In summary, LLMs excel at completing strings, while FM-based fault localization excel at identifying buggy parts of a program. In this paper, we propose a novel approach that combines the strengths of both FM-based fault localization and LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements. This hybrid approach follows a Counterexample Guided Inductive Synthesis (CEGIS) loop to iteratively refine the program. We ask the LLM to synthesize the missing parts, which are then checked against a test suite. If the suggested program is incorrect, a counterexample from the test suite is fed back to the LLM for revised synthesis. Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using MaxSAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMs. This method allows LLMs to repair more programs and produce smaller fixes, outperforming other configurations and state-of-the-art symbolic program repair tools.
- Research Article
2
- 10.1016/j.ipm.2024.103973
- Dec 3, 2024
- Information Processing and Management
Are large language models qualified reviewers in originality evaluation?
- Research Article
18
- 10.1055/a-2264-5631
- Feb 26, 2024
- RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
Large language models (LLMs) such as ChatGPT have shown significant potential in radiology. Their effectiveness often depends on prompt engineering, which optimizes the interaction with the chatbot for accurate results. Here, we highlight the critical role of prompt engineering in tailoring the LLMs' responses to specific medical tasks. Using a clinical case, we elucidate different prompting strategies to adapt the LLM ChatGPT using GPT4 to new tasks without additional training of the base model. These approaches range from precision prompts to advanced in-context methods such as few-shot and zero-shot learning. Additionally, the significance of embeddings, which serve as a data representation technique, is discussed. Prompt engineering substantially improved and focused the chatbot's output. Moreover, embedding of specialized knowledge allows for more transparent insight into the model's decision-making and thus enhances trust. Despite certain challenges, prompt engineering plays a pivotal role in harnessing the potential of LLMs for specialized tasks in the medical domain, particularly radiology. As LLMs continue to evolve, techniques like few-shot learning, zero-shot learning, and embedding-based retrieval mechanisms will become indispensable in delivering tailored outputs. · Large language models might impact radiological practice and decision-masking.. · However, implementation and performance are dependent on the assigned task.. · Optimization of prompting strategies can substantially improve model performance.. · Strategies for prompt engineering range from precision prompts to zero-shot learning.. · Russe MF, Reisert M, Bamberg F et al. Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning . Fortschr Röntgenstr 2024; 196: 1166 - 1170.
- Research Article
4
- 10.1093/bib/bbae354
- Jul 25, 2024
- Briefings in bioinformatics
Large language models (LLMs) are sophisticated AI-driven models trained on vast sources of natural language data. They are adept at generating responses that closely mimic human conversational patterns. One of the most notable examples is OpenAI's ChatGPT, which has been extensively used across diverse sectors. Despite their flexibility, a significant challenge arises as most users must transmit their data to the servers of companies operating these models. Utilizing ChatGPT or similar models online may inadvertently expose sensitive information to the risk of data breaches. Therefore, implementing LLMs that are open source and smaller in scale within a secure local network becomes a crucial step for organizations where ensuring data privacy and protection has the highest priority, such as regulatory agencies. As a feasibility evaluation, we implemented a series of open-source LLMs within a regulatory agency's local network and assessed their performance on specific tasks involving extracting relevant clinical pharmacology information from regulatory drug labels. Our research shows that some models work well in the context of few- or zero-shot learning, achieving performance comparable, or even better than, neural network models that needed thousands of training samples. One of the models was selected to address a real-world issue of finding intrinsic factors that affect drugs' clinical exposure without any training or fine-tuning. In a dataset of over 700000 sentences, the model showed a 78.5% accuracy rate. Our work pointed to the possibility of implementing open-source LLMs within a secure local network and using these models to perform various natural language processing tasks when large numbers of training examples are unavailable.
- Preprint Article
- 10.20944/preprints202505.2095.v1
- May 28, 2025
Research on reasoning processes is still under progress, Large Language Models (LLMs) have demonstrated remarkable natural language processing capacity recently. Emphasizing multi-step problem-solving, organized decision-making, and human feedback alignment, this paper critically reviews eight foundational works supporting the evolution of LLM reasoning. This paper investigates how generative pre-training (GPT-1, GPT-2) supports unsupervised and zero-shot learning after building parallelizable and scalable self-attention mechanisms with the Transformer architecture. Thanks to the invention of Chain-of- Thought (CoT) prompting, which demonstrated that sequential thinking increases logical coherence, LLMs can now study numerous paths of reasoning. Tree of Thoughts (ToT) later grew out from this. Reinforcement Learning from Human Feedback (RLHF) has been essential in improving LLM alignment beyond prompting strategies; Prototypical Reward Models (Proto-RM) improve the efficacy of learning from human preferences. Retrieval-Augmented Thought Trees (RATT) also solve the problem of factual consistency by including outside knowledge sources; Thought Space Explorer (TSE) increases cognitive exploration and lets LLMs find fresh ideas. By combining these approaches, this study reveals new tendencies, points out ongoing difficulties, and offers a comparative analysis of organized thinking in LLMs, so setting the groundwork for further advancements in artificial intelligence driven reasoning models. This report provides a summary of key methodologies presented in eight foundational papers, focusing on their evolution and impact on LLM reasoning.
- Research Article
- 10.1101/2025.02.27.640661
- Mar 3, 2025
- bioRxiv : the preprint server for biology
The fast accumulation of vast pharmacogenomics data of cancer cell lines provide unprecedented opportunities for drug sensitivity prediction (DSP), a crucial prerequisite for the advancement of precision oncology. Recently, Generative Large Language Models (LLM) have demonstrated performance and generalization prowess across diverse tasks in the field of natural language processing (NLP). However, the structured format of the pharmacogenomics data poses challenge for the utility of LLM in DSP. Therefore, the objective of this study is multi-fold: to adapt prompt engineering for structured pharmacogenomics data toward optimizing LLM's DSP performance, to evaluate LLM's generalization in real-world DSP scenarios, and to compare LLM's DSP performance against that of state-of-the-science baselines. We systematically investigated the capability of the Generative Pre-trained Transformer (GPT) as a DSP model on four publicly available benchmark pharmacogenomics datasets, which are stratified by five cancer tissue types of cell lines and encompass both oncology and non-oncology drugs. Essentially, the predictive landscape of GPT is assessed for effectiveness on the DSP task via four learning paradigms: zero-shot learning, few-shot learning, fine-tuning and clustering pretrained embeddings. To facilitate GPT in seamlessly processing the structured pharmacogenomics data, domain-specific novel prompt engineering is employed by implementing three prompt templates (i.e., Instruction, Instruction-Prefix, Cloze) and integrating pharmacogenomics-related features into the prompt. We validated GPT's performance in diverse real-world DSP scenarios: cross-tissue generalization, blind tests, and analyses of drug-pathway associations and top sensitive/resistant cell lines. Furthermore, we conducted a comparative evaluation of GPT against multiple Transformer-based pretrained models and existing DSP baselines. Extensive experiments on the pharmacogenomics datasets across the five tissue cohorts demonstrate that fine-tuning GPT yields the best DSP performance (28% F1 increase, p-value= 0.0003) followed by clustering pretrained GPT embeddings (26% F1 increase, p-value= 0.0005), outperforming GPT in-context learning (i.e., few-shot). However, GPT in the zero-shot setting had a big F1 gap, resulting in the worst performance. Within the scope of prompt engineering, performance enhancement was achieved by directly instructing GPT about the DSP task and resorting to a concise context format (i.e., instruction-prefix), leading to F1 performance gain of 22% (p-value=0.02); while incorporation of drug-cell line prompt context derived from genomics and/or molecular features further boosted F1 score by 2%. Compared to state-of-the-science DSP baselines, GPT significantly asserted superior mean F1 performance (16% gain, p-value<0.05) on the GDSC dataset. In the cross-tissue analysis, GPT showcased comparable generalizability to the within-tissue performances on the GDSC and PRISM datasets, while statistically significant F1 performance improvements on the CCLE (8%, p-value=0.001) and DrugComb (19%, p-value=0.009) datasets. Evaluation on the challenging blind tests suggests GPT's competitiveness on the CCLE and DrugComb datasets compared to random splitting. Furthermore, analyses of the drug-pathway associations and log probabilities provided valuable insights that align with previous DSP findings. The diverse experiment setups and in-depth analysis underscore the importance of generative LLM, such as GPT, as a viable in silico approach to guide precision oncology. https://github.com/bioIKEA/SensitiveCancerGPT.
- Research Article
- 10.1016/j.compbiomed.2025.110181
- Jun 1, 2025
- Computers in biology and medicine
Zero-shot learning for clinical phenotyping: Comparing LLMs and rule-based methods.
- Research Article
- 10.52783/jisem.v10i42s.7922
- May 3, 2025
- Journal of Information Systems Engineering and Management
For understanding novel tasks, Zero-Shot Learning (ZSL) in combination with Large Language Models (LLMs) exhibits immense potential. By solely depending on task descriptions or guidelines provided in natural language, LLMs can deduce solutions without requiring explicit training data. For instance, an LLM could be assigned the task of summarizing a newly introduced scientific principle or responding to inquiries on an unfamiliar subject. The model's capability to understand tasks from linguistic indicators and apply pre-acquired knowledge is what makes ZSL particularly effective. Despite these advancements, challenges persist in implementing ZSL with LLMs for new task comprehension. Performance inconsistencies arise when novel tasks significantly differ from training data. Additionally, errors or irrelevant outputs may occur due to misinterpretations. Addressing biases in training data, ensuring output consistency, and enhancing interpretability remain crucial areas for further research.
- Research Article
12
- 10.1609/aaai.v38i19.30178
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ML tasks. However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability.
- Research Article
- 10.1016/j.artmed.2025.103268
- Dec 1, 2025
- Artificial intelligence in medicine
A survey for large language models in biomedicine.
- Research Article
- Jan 1, 2025
- AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
Automated text summarization (ATS) is crucial for collecting specialized, domain-specific information. Zero-shot learning (ZSL) allows large language models (LLMs) to respond to prompts on information not included in their training, playing a vital role in this process. This study evaluates LLMs' effectiveness in generating accurate summaries under ZSL conditions and explores using retrieval augmented generation (RAG) and prompt engineering to enhance factual accuracy and understanding. We combined LLMs with summarization modeling, prompt engineering, and RAG, evaluating the summaries using the METEOR metric and keyword frequencies through word clouds. Results indicate that LLMs are generally well-suited for ATS tasks, demonstrating an ability to handle specialized information under ZSL conditions with RAG. However, web scraping limitations hinder a single generalized retrieval mechanism. While LLMs show promise for ATS under ZSL conditions with RAG, challenges like goal misgeneralization and web scraping limitations need addressing. Future research should focus on solutions to these issues.
- Research Article
- 10.2118/1125-0001-jpt
- Nov 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0016-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0019-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0007-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0014-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0009-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0003-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0022-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0005-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Research Article
- 10.2118/1025-0015-jpt
- Oct 1, 2025
- Journal of Petroleum Technology
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.