Benchmarking large language models on safety risks in scientific laboratories

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Benchmarking large language models on safety risks in scientific laboratories

Similar Papers
  • Discussion
  • Cite Count Icon 6
  • 10.1016/j.ebiom.2023.104672
Response to M. Trengove & coll regarding "Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine".
  • Jul 1, 2023
  • eBioMedicine
  • Stefan Harrer

Response to M. Trengove & coll regarding "Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine".

  • Research Article
  • 10.1016/j.nuclphysbps.2011.09.021
The Cascades Proposal for the Deep Underground Science and Engineering Laboratory
  • Dec 1, 2011
  • Nuclear Physics B - Proceedings Supplements
  • W.C Haxton + 1 more

The Cascades Proposal for the Deep Underground Science and Engineering Laboratory

  • Research Article
  • 10.3390/bioengineering12070706
The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision-Making
  • Jun 27, 2025
  • Bioengineering
  • Maissa Trabilsy + 9 more

The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. The PIEE cycle—Planning and Preparation, Information Gathering and Prompt Generation, Execution, and Evaluation—is a structured red-teaming framework developed specifically to address artificial intelligence (AI) safety risks in healthcare decision-making. PIEE enables clinicians and informatics teams to simulate adversarial prompts, including jailbreaking, social engineering, and distractor attacks, to stress-test language models in real-world clinical scenarios. Model performance is evaluated using specific metrics such as true positive and false positive rates for detecting harmful content, hallucination rates measured through adapted TruthfulQA scoring, safety and reliability assessments, bias detection via adapted BBQ benchmarks, and ethical evaluation using structured Likert-based scoring rubrics. The framework is illustrated using examples from plastic surgery, but is adaptable across specialties, and is intended for use by all medical providers, regardless of their backgrounds or familiarity with artificial intelligence. While the framework is currently conceptual and validation is ongoing, PIEE provides a practical foundation for assessing the clinical reliability and ethical robustness of LLMs in medicine.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/electronics13204044
Towards an End-to-End Personal Fine-Tuning Framework for AI Value Alignment
  • Oct 14, 2024
  • Electronics
  • Eleanor Watson + 4 more

This study introduces a novel architecture for value, preference, and boundary alignment in large language models (LLMs) and generative AI systems, accompanied by an experimental implementation. It addresses the limitations in AI model trustworthiness stemming from insufficient comprehension of personal context, preferences, and cultural diversity, which can lead to biases and safety risks. Using an inductive, qualitative research approach, we propose a framework for personalizing AI models to improve model alignment through additional context and boundaries set by users. Our framework incorporates user-friendly tools for identification, annotation, and simulation across diverse contexts, utilizing prompt-driven semantic segmentation and automatic labeling. It aims to streamline scenario generation and personalization processes while providing accessible annotation tools. The study examines various components of this framework, including user interfaces, underlying tools, and system mechanics. We present a pilot study that demonstrates the framework’s ability to reduce the complexity of value elicitation and personalization in LLMs. Our experimental setup involves a prototype implementation of key framework modules, including a value elicitation interface and a fine-tuning mechanism for language models. The primary goal is to create a token-based system that allows users to easily impart their values and preferences to AI systems, enhancing model personalization and alignment. This research contributes to the democratization of AI model fine-tuning and dataset generation, advancing efforts in AI value alignment. By focusing on practical implementation and user interaction, our study bridges the gap between theoretical alignment approaches and real-world applications in AI systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.2478/amns.2023.1.00204
Analysis of quantitative management of online intelligent monitoring of tailing ponds based on the perspective of safety prevention and control
  • Jun 2, 2023
  • Applied Mathematics and Nonlinear Sciences
  • Wenjun Ma + 4 more

China’s tailing pond online monitoring technology started late, and the tailing pond is located in a harsh working environment, for the limitations of traditional manual monitoring of tailing pond, combined with the actual situation of Zhenhua Mining tailing pond. This paper constructs a risk monitoring index system and online monitoring early warning model based on (Language Model - Back Propagation, LM-BP) neural network to quantitatively assess tailing pond safety risks and analyze and judge safety risk trends. We extracted common indicators of regional tailing ponds, combined with meteorological data to establish a regional safety risk assessment model, integrated vulnerability of disaster-bearing bodies, environmental sensitivity and other influencing factors, realized regional risk coupling analysis, and dynamically built a risk cloud map. Based on the perspective of safety risk prevention and control, the integrity and accuracy of monitoring data are analyzed, the causes of early warning are inverted, alarm disposal mechanisms are established, and closed-loop management of early warning is realized to provide scientific auxiliary decision-making support for tailing pond safety supervisors.

  • Conference Article
  • 10.4043/36196-ms
Automated Classification of Offshore Production Units Maintenance Notes Using Machine Learning and Large Language Models
  • Oct 21, 2025
  • O C Correa + 6 more

Regular maintenance is vital for offshore rig safety and cost-efficiency, requiring careful planning to prevent downtime. Thousands of procedures occur yearly, posing challenges in resource allocation and scheduling. A key issue is identifying tasks that require production halts versus those that can proceed safely, as misclassification may cause downtime or safety risks. This paper explores using machine learning and large language models (LLMs) to automate maintenance task classification based on historical records. Methods, Procedures, Process The methodology involved a structured data pipeline to unify and prepare maintenance records for machine learning classification. Maintenance notes and tactical portfolios were consolidated into a single dataset through filtering, renaming, and type conversion. A Random Forest classifier was developed using thousands of offshore maintenance records. LLMs were employed to extract relevant terms and contextual features from maintenance notes to improve classification performance. These features were integrated into the classifier to distinguish tasks requiring production stops from those that did not. Model performance was evaluated using accuracy, precision, and F-score, and compared to traditional feature engineering methods. Results, Observations, Conclusions The classification model achieved 92% accuracy, with precision and F-score also reaching 92%, demonstrating robustness in real-world scenarios. Integrating LLMs significantly improved reliability compared to conventional text-based feature extraction. By automating classification, the system enables more informed decision-making, optimizing scheduling and minimizing unnecessary halts, thereby reducing safety risks. These advancements enhance operational efficiency and cost-effectiveness. The study also highlights the potential of AI-driven approaches in addressing offshore maintenance challenges, particularly logistical coordination and resource allocation. The system processes maintenance notes in two ways: through integration with proprietary systems accessing relevant databases, and through manual uploading of data extracted from CMMS. This dual data ingestion ensures comprehensive coverage, further supporting the model's efficacy in improving decision-making and efficiency in offshore maintenance. Novel/Additive Information This research introduces an innovative application of LLMs for offshore maintenance classification, outperforming traditional methods. It integrates AI to improve maintenance planning and resource allocation, contributing to petroleum industry efficiency. This work builds on the Petrobras Conexões para Inovação project, where machine learning was tested to classify maintenance notes from three offshore platforms.

  • Research Article
  • Cite Count Icon 2
  • 10.1108/ecam-08-2024-1143
Usability of large language models for building construction safety risk assessment
  • Jun 24, 2025
  • Engineering, Construction and Architectural Management
  • Mustafa Oral + 3 more

PurposeRisk assessment is an approach that involves identifying potential workplace risks and determining the necessary precautions to reduce their impact on workers. The advent of artificial intelligence (AI) technology in recent years has greatly benefited safety experts in their assessments of risks. Large language models (LLMs), such as ChatGPT, may provide significant advantages in occupational safety professionals’ risk assessment processes. LLMs enable them to quickly access information, generate reports, analyze data and provide recommendations thanks to their natural language processing capability. This study aims to evaluate the usability of LLMs as a decision-support tool for risk assessments in building construction.Design/methodology/approachFirst, risks and precautions were defined for 12 work items in building construction. Subsequently, ten experts and ChatGPT were requested to evaluate the risks based on their level of importance using a five-point Likert scale. The similarity of the responses was calculated using the Modified Manhattan Distance. Next, the precautionary choices made by the experts and ChatGPT were compared.FindingsIt was found that the LLM provided similar answers to the experts in terms of risk scores and precaution selection. Nevertheless, the similarity value of ChatGPT responses surpasses the similarity value of expert responses.Originality/valueThis study enhances the existing body of knowledge and provides valuable insights to industry stakeholders by showcasing the effectiveness of LLMs in evaluating occupational health and safety hazards. Moreover, to the best of our knowledge, this study represents one of the initial attempts to evaluate occupational safety and health risks with ChatGPT.

  • Supplementary Content
  • 10.1108/ir-02-2025-0074
Large language and vision-language models for robot: safety challenges, mitigation strategies and future directions
  • Jul 29, 2025
  • Industrial Robot: the international journal of robotics research and application
  • Xiangyu Hu + 1 more

Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.

  • Research Article
  • Cite Count Icon 4
  • 10.1177/153567600501000303
Human Behavior in a Matrix of Hazards: Risk, Rules, and Ratio in Biomedical Laboratory Safety
  • Sep 1, 2005
  • Applied Biosafety
  • Michael B Blayney + 1 more

Laboratory safety presents special challenges in occupational safety and health management around the world. In scientific laboratories, all kinds of hazardous materials (biological, chemical, and radiological) are present—either individually or in some combination. Additionally, physical hazards in laboratories are ubiquitous and add to the overall risk faced by scientists, students, and the environment in all but the most benign settings. Managing the risks found in laboratories encompasses many aspects including safety rules, attitudes, opinions, and hazard and risk perception. A comparative look at the safety and health concerns of biomedical research scientists in The Netherlands and the United States lead the authors to take a broad, international view of laboratory safety management. One empirical finding, a regulatory context, and one assumption underly this viewpoint. First, the safety and health concerns of biomedical research scientists in a Dutch research institution and an American research institution turn out to be similar in many respects. Second, the relevant laws and regulations in the European Union (EU) and the United States (US) form the basis of “rules” intended to address biomedical laboratory safety concerns. Rules, however, tend to cover only a portion of the concerns expressed by biomedical research scientists. Our assumption is that attitudes and opinions are also important determinants in the establishment of a balance—a ratio—between the hazards and the rules intended to manage those risks. Based on this international view, the authors propose the development of a consensus-based international training curriculum and introductory program for biomedical research scientists covering relevant issues in health, safety, and environmental protection.

  • Research Article
  • 10.38124/ijisrt/26jan1453
Assessment of Memorization, Prompt Inference, and Retrieval Risks in Healthcare Large Language Models
  • Feb 5, 2026
  • International Journal of Innovative Science and Research Technology
  • Olufunke Adebola Akande + 3 more

This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.aap.2025.108041
Collision risk prediction and takeover requirements assessment based on radar-video integrated sensors data: A system framework based on LLM.
  • Aug 1, 2025
  • Accident; analysis and prevention
  • Qingchao Liu + 5 more

Collision risk prediction and takeover requirements assessment based on radar-video integrated sensors data: A system framework based on LLM.

  • Research Article
  • Cite Count Icon 14
  • 10.1162/tacl_a_00639
Red Teaming Language Model Detectors with Language Models
  • Feb 23, 2024
  • Transactions of the Association for Computational Linguistics
  • Zhouxing Shi + 5 more

The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/56126
Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.
  • Feb 5, 2025
  • JMIR formative research
  • Nicola Luigi Bragazzi + 7 more

The COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an "infodemic" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. This study aimed to assess LLMs' proficiency, clarity, and objectivity regarding COVID-19's impacts on pregnancy. This study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard (-7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.

  • Book Chapter
  • 10.3233/nhsdp250086
AI to Support Frontline Mental Health Workers in the Ukraine War
  • Jan 8, 2026
  • Isaac R Galatzer-Levy + 3 more

The ongoing war in Ukraine presents unprecedented challenges to the mental health and readiness of its soldiers. Frontline mental health workers face immense pressure, operating with limited resources and often in isolation, while navigating complex clinical presentations amidst active combat. This chapter details the development and validation process of a novel Large Language Model (LLM) agent designed to serve as a digital companion for these frontline professionals. This AI tool aims to bridge the gap in immediate peer consultation and expert guidance, offering decision support grounded in established Combat and Operational Stress Control (COSC) principles and tailored to the unique cultural and operational context of the Ukrainian military. The development process involves collecting rich narrative case studies and decision-making challenges directly from experienced Ukrainian frontline mental health workers. These narratives form the basis for prompt engineering and few-shot learning, iteratively refining the LLM agent’s ability to understand context, assess symptom severity, functionality, and safety risks, and provide relevant disposition options (Return to Duty, In-Theater Support, Evacuation). The validation framework emphasizes accuracy against expert annotations, adherence to established COSC guidelines, and sensitivity to Ukrainian cultural nuances, including the historical context of mental health services and the importance of military identity rooted in figures like the Zaporozhian Cossacks. The chapter addresses key challenges, including overcoming stigma, the need for standardized assessment tools, and the ethical considerations of deploying AI in a high-stakes environment. By simulating collaborative decision-making, the AI companion seeks to enhance the capacity of frontline workers, promote consistent application of best practices like Combat Path Debriefing, and ultimately bolster the resilience and combat effectiveness of Ukrainian soldiers facing prolonged and intense operational stress. This project represents a crucial step in leveraging AI to augment human expertise in crisis situations, offering a scalable solution to support mental health providers on the most demanding frontlines.

  • Research Article
  • Cite Count Icon 19
  • 10.1177/10946705251340487
Generative AI Meets Service Robots
  • May 29, 2025
  • Journal of Service Research
  • Jochen Wirtz + 1 more

We explore the transformative impact of integrating generative artificial intelligence (GenAI) in the form of large language models (LLMs), large behavioral models (LBMs), and agentic AI into physical service robots and how these will transform physical service encounters. This conceptual article first shows that GenAI-powered service robots (also referred to as GenAI robots) will be able to autonomously deliver more complex, customized, and personalized customer service. Second, GenAI’s increasing capacity for no-code programming is expected to democratize robot training, improvement, and fine-tuning by frontline employees, thus improving robot performance. Third, the implications of GenAI robots are outlined for frontline employees (i.e., their work and job scopes, and a new role as citizen developer), customers (i.e., improved customer experiences and service outcomes), and the service firm (i.e., a pathway to cost-effective service excellence, continuous improvement and agility, alleviation of labor shortage, and the introduction of new ethical, fairness, privacy, health, and safety risks into physical service encounters). This article is the first to explore the theoretical and practical implications of GenAI robots in physical service encounters and opens a new stream of service research.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.