Benchmarking large language models on safety risks in scientific laboratories
Benchmarking large language models on safety risks in scientific laboratories
- Discussion
6
- 10.1016/j.ebiom.2023.104672
- Jul 1, 2023
- eBioMedicine
Response to M. Trengove & coll regarding "Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine".
- Research Article
- 10.1016/j.nuclphysbps.2011.09.021
- Dec 1, 2011
- Nuclear Physics B - Proceedings Supplements
The Cascades Proposal for the Deep Underground Science and Engineering Laboratory
- Research Article
- 10.3390/bioengineering12070706
- Jun 27, 2025
- Bioengineering
The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. The PIEE cycle—Planning and Preparation, Information Gathering and Prompt Generation, Execution, and Evaluation—is a structured red-teaming framework developed specifically to address artificial intelligence (AI) safety risks in healthcare decision-making. PIEE enables clinicians and informatics teams to simulate adversarial prompts, including jailbreaking, social engineering, and distractor attacks, to stress-test language models in real-world clinical scenarios. Model performance is evaluated using specific metrics such as true positive and false positive rates for detecting harmful content, hallucination rates measured through adapted TruthfulQA scoring, safety and reliability assessments, bias detection via adapted BBQ benchmarks, and ethical evaluation using structured Likert-based scoring rubrics. The framework is illustrated using examples from plastic surgery, but is adaptable across specialties, and is intended for use by all medical providers, regardless of their backgrounds or familiarity with artificial intelligence. While the framework is currently conceptual and validation is ongoing, PIEE provides a practical foundation for assessing the clinical reliability and ethical robustness of LLMs in medicine.
- Research Article
2
- 10.3390/electronics13204044
- Oct 14, 2024
- Electronics
This study introduces a novel architecture for value, preference, and boundary alignment in large language models (LLMs) and generative AI systems, accompanied by an experimental implementation. It addresses the limitations in AI model trustworthiness stemming from insufficient comprehension of personal context, preferences, and cultural diversity, which can lead to biases and safety risks. Using an inductive, qualitative research approach, we propose a framework for personalizing AI models to improve model alignment through additional context and boundaries set by users. Our framework incorporates user-friendly tools for identification, annotation, and simulation across diverse contexts, utilizing prompt-driven semantic segmentation and automatic labeling. It aims to streamline scenario generation and personalization processes while providing accessible annotation tools. The study examines various components of this framework, including user interfaces, underlying tools, and system mechanics. We present a pilot study that demonstrates the framework’s ability to reduce the complexity of value elicitation and personalization in LLMs. Our experimental setup involves a prototype implementation of key framework modules, including a value elicitation interface and a fine-tuning mechanism for language models. The primary goal is to create a token-based system that allows users to easily impart their values and preferences to AI systems, enhancing model personalization and alignment. This research contributes to the democratization of AI model fine-tuning and dataset generation, advancing efforts in AI value alignment. By focusing on practical implementation and user interaction, our study bridges the gap between theoretical alignment approaches and real-world applications in AI systems.
- Research Article
1
- 10.2478/amns.2023.1.00204
- Jun 2, 2023
- Applied Mathematics and Nonlinear Sciences
China’s tailing pond online monitoring technology started late, and the tailing pond is located in a harsh working environment, for the limitations of traditional manual monitoring of tailing pond, combined with the actual situation of Zhenhua Mining tailing pond. This paper constructs a risk monitoring index system and online monitoring early warning model based on (Language Model - Back Propagation, LM-BP) neural network to quantitatively assess tailing pond safety risks and analyze and judge safety risk trends. We extracted common indicators of regional tailing ponds, combined with meteorological data to establish a regional safety risk assessment model, integrated vulnerability of disaster-bearing bodies, environmental sensitivity and other influencing factors, realized regional risk coupling analysis, and dynamically built a risk cloud map. Based on the perspective of safety risk prevention and control, the integrity and accuracy of monitoring data are analyzed, the causes of early warning are inverted, alarm disposal mechanisms are established, and closed-loop management of early warning is realized to provide scientific auxiliary decision-making support for tailing pond safety supervisors.
- Conference Article
- 10.4043/36196-ms
- Oct 21, 2025
Regular maintenance is vital for offshore rig safety and cost-efficiency, requiring careful planning to prevent downtime. Thousands of procedures occur yearly, posing challenges in resource allocation and scheduling. A key issue is identifying tasks that require production halts versus those that can proceed safely, as misclassification may cause downtime or safety risks. This paper explores using machine learning and large language models (LLMs) to automate maintenance task classification based on historical records. Methods, Procedures, Process The methodology involved a structured data pipeline to unify and prepare maintenance records for machine learning classification. Maintenance notes and tactical portfolios were consolidated into a single dataset through filtering, renaming, and type conversion. A Random Forest classifier was developed using thousands of offshore maintenance records. LLMs were employed to extract relevant terms and contextual features from maintenance notes to improve classification performance. These features were integrated into the classifier to distinguish tasks requiring production stops from those that did not. Model performance was evaluated using accuracy, precision, and F-score, and compared to traditional feature engineering methods. Results, Observations, Conclusions The classification model achieved 92% accuracy, with precision and F-score also reaching 92%, demonstrating robustness in real-world scenarios. Integrating LLMs significantly improved reliability compared to conventional text-based feature extraction. By automating classification, the system enables more informed decision-making, optimizing scheduling and minimizing unnecessary halts, thereby reducing safety risks. These advancements enhance operational efficiency and cost-effectiveness. The study also highlights the potential of AI-driven approaches in addressing offshore maintenance challenges, particularly logistical coordination and resource allocation. The system processes maintenance notes in two ways: through integration with proprietary systems accessing relevant databases, and through manual uploading of data extracted from CMMS. This dual data ingestion ensures comprehensive coverage, further supporting the model's efficacy in improving decision-making and efficiency in offshore maintenance. Novel/Additive Information This research introduces an innovative application of LLMs for offshore maintenance classification, outperforming traditional methods. It integrates AI to improve maintenance planning and resource allocation, contributing to petroleum industry efficiency. This work builds on the Petrobras Conexões para Inovação project, where machine learning was tested to classify maintenance notes from three offshore platforms.
- Research Article
2
- 10.1108/ecam-08-2024-1143
- Jun 24, 2025
- Engineering, Construction and Architectural Management
PurposeRisk assessment is an approach that involves identifying potential workplace risks and determining the necessary precautions to reduce their impact on workers. The advent of artificial intelligence (AI) technology in recent years has greatly benefited safety experts in their assessments of risks. Large language models (LLMs), such as ChatGPT, may provide significant advantages in occupational safety professionals’ risk assessment processes. LLMs enable them to quickly access information, generate reports, analyze data and provide recommendations thanks to their natural language processing capability. This study aims to evaluate the usability of LLMs as a decision-support tool for risk assessments in building construction.Design/methodology/approachFirst, risks and precautions were defined for 12 work items in building construction. Subsequently, ten experts and ChatGPT were requested to evaluate the risks based on their level of importance using a five-point Likert scale. The similarity of the responses was calculated using the Modified Manhattan Distance. Next, the precautionary choices made by the experts and ChatGPT were compared.FindingsIt was found that the LLM provided similar answers to the experts in terms of risk scores and precaution selection. Nevertheless, the similarity value of ChatGPT responses surpasses the similarity value of expert responses.Originality/valueThis study enhances the existing body of knowledge and provides valuable insights to industry stakeholders by showcasing the effectiveness of LLMs in evaluating occupational health and safety hazards. Moreover, to the best of our knowledge, this study represents one of the initial attempts to evaluate occupational safety and health risks with ChatGPT.
- Supplementary Content
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
4
- 10.1177/153567600501000303
- Sep 1, 2005
- Applied Biosafety
Laboratory safety presents special challenges in occupational safety and health management around the world. In scientific laboratories, all kinds of hazardous materials (biological, chemical, and radiological) are present—either individually or in some combination. Additionally, physical hazards in laboratories are ubiquitous and add to the overall risk faced by scientists, students, and the environment in all but the most benign settings. Managing the risks found in laboratories encompasses many aspects including safety rules, attitudes, opinions, and hazard and risk perception. A comparative look at the safety and health concerns of biomedical research scientists in The Netherlands and the United States lead the authors to take a broad, international view of laboratory safety management. One empirical finding, a regulatory context, and one assumption underly this viewpoint. First, the safety and health concerns of biomedical research scientists in a Dutch research institution and an American research institution turn out to be similar in many respects. Second, the relevant laws and regulations in the European Union (EU) and the United States (US) form the basis of “rules” intended to address biomedical laboratory safety concerns. Rules, however, tend to cover only a portion of the concerns expressed by biomedical research scientists. Our assumption is that attitudes and opinions are also important determinants in the establishment of a balance—a ratio—between the hazards and the rules intended to manage those risks. Based on this international view, the authors propose the development of a consensus-based international training curriculum and introductory program for biomedical research scientists covering relevant issues in health, safety, and environmental protection.
- Research Article
- 10.38124/ijisrt/26jan1453
- Feb 5, 2026
- International Journal of Innovative Science and Research Technology
This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.
- Research Article
4
- 10.1016/j.aap.2025.108041
- Aug 1, 2025
- Accident; analysis and prevention
Collision risk prediction and takeover requirements assessment based on radar-video integrated sensors data: A system framework based on LLM.
- Research Article
14
- 10.1162/tacl_a_00639
- Feb 23, 2024
- Transactions of the Association for Computational Linguistics
The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.
- Research Article
3
- 10.2196/56126
- Feb 5, 2025
- JMIR formative research
The COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an "infodemic" of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. This study aimed to assess LLMs' proficiency, clarity, and objectivity regarding COVID-19's impacts on pregnancy. This study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. In terms of LLMs' knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (-4), followed by ChatGPT-4 (-6) and Google Bard (-7), while ChatGPT-3.5 obtained the most negative score (-12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.
- Book Chapter
- 10.3233/nhsdp250086
- Jan 8, 2026
The ongoing war in Ukraine presents unprecedented challenges to the mental health and readiness of its soldiers. Frontline mental health workers face immense pressure, operating with limited resources and often in isolation, while navigating complex clinical presentations amidst active combat. This chapter details the development and validation process of a novel Large Language Model (LLM) agent designed to serve as a digital companion for these frontline professionals. This AI tool aims to bridge the gap in immediate peer consultation and expert guidance, offering decision support grounded in established Combat and Operational Stress Control (COSC) principles and tailored to the unique cultural and operational context of the Ukrainian military. The development process involves collecting rich narrative case studies and decision-making challenges directly from experienced Ukrainian frontline mental health workers. These narratives form the basis for prompt engineering and few-shot learning, iteratively refining the LLM agent’s ability to understand context, assess symptom severity, functionality, and safety risks, and provide relevant disposition options (Return to Duty, In-Theater Support, Evacuation). The validation framework emphasizes accuracy against expert annotations, adherence to established COSC guidelines, and sensitivity to Ukrainian cultural nuances, including the historical context of mental health services and the importance of military identity rooted in figures like the Zaporozhian Cossacks. The chapter addresses key challenges, including overcoming stigma, the need for standardized assessment tools, and the ethical considerations of deploying AI in a high-stakes environment. By simulating collaborative decision-making, the AI companion seeks to enhance the capacity of frontline workers, promote consistent application of best practices like Combat Path Debriefing, and ultimately bolster the resilience and combat effectiveness of Ukrainian soldiers facing prolonged and intense operational stress. This project represents a crucial step in leveraging AI to augment human expertise in crisis situations, offering a scalable solution to support mental health providers on the most demanding frontlines.
- Research Article
19
- 10.1177/10946705251340487
- May 29, 2025
- Journal of Service Research
We explore the transformative impact of integrating generative artificial intelligence (GenAI) in the form of large language models (LLMs), large behavioral models (LBMs), and agentic AI into physical service robots and how these will transform physical service encounters. This conceptual article first shows that GenAI-powered service robots (also referred to as GenAI robots) will be able to autonomously deliver more complex, customized, and personalized customer service. Second, GenAI’s increasing capacity for no-code programming is expected to democratize robot training, improvement, and fine-tuning by frontline employees, thus improving robot performance. Third, the implications of GenAI robots are outlined for frontline employees (i.e., their work and job scopes, and a new role as citizen developer), customers (i.e., improved customer experiences and service outcomes), and the service firm (i.e., a pathway to cost-effective service excellence, continuous improvement and agility, alleviation of labor shortage, and the introduction of new ethical, fairness, privacy, health, and safety risks into physical service encounters). This article is the first to explore the theoretical and practical implications of GenAI robots in physical service encounters and opens a new stream of service research.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.