Reporting Guidelines for Large Language Models in Human–Robot Interaction
The comparatively recent advent of Large Language Models (LLMs) has resulted in a wide array of new capabilities and components relevant to Human–Robot Interaction (HRI) researchers. LLMs are being applied to vision, manipulation, planning, reasoning, learning, and HRI problems, frequently as “Scarecrows,” in which LLMs serve as black box modules integrated into robot architectures for the purpose of quickly enabling full-pipeline solutions. However, despite this explosion of applications, general questions remain about the best ways to incorporate LLMs into robot architectures, appropriate safety and guardrail considerations, and, critically, how to report properly on HRI research that involves LLMs. In this article, we explore the question of reporting guidelines for HRI researchers who utilize Scarecrows in robot architectures. We identify five key stakeholder groups in the HRI research process, discuss what information each group needs from HRI researchers, and identify appropriate mechanisms for conveying that information from HRI researchers to stakeholders either directly or indirectly. We contribute a set of suggested guidelines regarding what information should be included when researchers disseminate information about HRI research that uses LLMs.
- Research Article
8
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
- 10.28945/5693
- Jan 1, 2026
- Journal of Information Technology Education: Research
Aim/Purpose: The study investigates the factors influencing the acceptance and utilisation of large language models (LLMs) (predictor variables of LLM usage), such as ChatGPT, in Learning design by instructional designers and university-teaching academics from various countries. Background: Large language models (LLMs) have exploded onto the scene, transforming the landscape of learning design. Instructional designers and university teaching academics have been overburdened with content creation for their teaching programmes, and the arrival of LLM models will help in this regard by developing more interactive content that drives student engagement and, in turn, contributes to student success. Since LLMs are a relatively new phenomenon, little is known about the factors influencing their acceptance in learning design; therefore, this research is needed, as learning design principles are the bedrock of student engagement and success. Methodology: A cross-sectional correlational quantitative study was employed. Data was collected using an online questionnaire posted on social media, including LinkedIn, from 203 instructional designers and university teaching academics. Purposive and snowball sampling methods were used to target instructional designers and university teaching academics at colleges and universities worldwide. Participants were asked to share the survey link with fellow instructional designers and university-teaching academics in their communities. The factor structure of the data was determined using exploratory factor analysis. Nonetheless, the factor structure derived from the LLMs did not entirely reflect the original configuration of the Unified Theory of Acceptance and Use of Technology (UTAUT3), as certain predictors appeared to coalesce, indicating LLMs’ unique nature in learning design. Confirmatory factor analysis was used to verify the fit of the data on the measurement model. First-order and second-order structural modelling were used to identify the structural relationships among the variables. Contribution: The study determines significant factors for the acceptance of LLMs by instructional designers and academic teaching staff in learning design, enabling possible opportunities for best practices in the field through interventions to optimize LLM usage. The study applies the technology acceptance model to the emerging LLM technology and extends the technology acceptance model by adding the trust construct as a predictor variable. Findings: The structural analysis results indicated that the ingrained LLM practices, LLM peer-driven expectations, innovative propensity towards LLM adoption, reliability and provider trust in LLMs, and ease of use and support influenced perceived LLM benefits and usage, but community standards and infrastructure had no influence. The second-order structural equation modelling indicated that perceived LLM benefits and usage and ingrained LLM habits contributed most to the learning design. Recommendations for Practitioners: Teaching academics and instructional designers must use LLMs in designing content, assessments, and interactive learning activities, and attend LLM training workshops on prompting and best practices in integrating LLMs into learning and teaching to see their benefits; hence, regular use of LLMs will then lead to trust and innovation in LLMs usage, enhancing learning design and improving student learning outcomes. Recommendation for Researchers: Researchers must use mixed methods approaches to have a deeper understanding of the factors influencing LLMs. Since habit and perceived LLM benefits and usage contributed the most variance to learning design, researchers must investigate strategies that optimise these factors in learning design, such as effective intervention strategies that can help form positive LLM habits. In addition, the findings provide researchers with a starting point for future research. Further researchers must investigate interventions that optimise the influence of personal innovativeness and trust that contributed the least variance to learning design, hence unlocking the potential of LLMs in learning design through innovation, responsible, and ethical use. Impact on Society: The use of LLMs in learning design has a high possibility of transforming education, specifically the learning design landscape. Using LLMs will free up more time for teaching academics and instructional designers so that they spend more time on higher-order thinking skill demands. Consequently, the students will be exposed to more engaging and interactive content, resulting in improved learning outcomes. Future Research: Future research must include context-derived external variables in technology acceptance models, such as levels of prompting competencies, to provide a deeper understanding of LLMs. In addition, future research must be based on the application and impact of LLMs on student engagement and success, and their attainment of 21st-century skills.
- Research Article
- 10.1177/154193120905301855
- Oct 1, 2009
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
This paper examines the intricacies of applied robotics research, both in the laboratory and in the field. Described within will be some of the differences between lab and field studies that researchers must diligently work to reduce. Areas discussed include differences in technological capabilities, team composition, and system reliability. Rather than report the results of a single study, the purpose of this paper is to bring the Human Robot Interaction (HRI) research community closer as a whole. It also serves to emphasize the real-life implications of applied laboratory efforts, as opposed to fixating on the statistics alone. Specific ‘lessons learned’ with respect to successful, and not-so successful, strategies for conducting lab-based HRI research are also included. Finally, a testing facility for the continued congruence between lab and field HRI research is proposed.
- Conference Article
50
- 10.1109/hri.2019.8673123
- Mar 1, 2019
Previous research in moral psychology and human-robot interaction has shown that technology shapes human morality, and research in human-robot interaction has shown that humans naturally perceive robots as moral agents. Accordingly, we propose that language-capable autonomous robots are uniquely positioned among technologies to significantly impact human morality. We therefore argue that it is imperative that language-capable robots behave according to human moral norms and communicate in such a way that their intention to adhere to those norms is clear. Unfortunately, the design of current natural language oriented robot architectures enables certain architectural components to circumvent or preempt those architectures' moral reasoning capabilities. In this paper, we show how this may occur, using clarification request generation in current dialog systems as a motivating example. Furthermore, we present experimental evidence that the types of behavior exhibited by current approaches to clarification request generation can cause robots to (1) miscommunicate their moral intentions and (2) weaken humans' perceptions of moral norms within the current context. This work strengthens previous preliminary findings, and does so within an experimental paradigm that provides increased external and ecological validity over earlier approaches.
- Research Article
13
- 10.1016/j.chbah.2023.100039
- Dec 25, 2023
- Computers in Human Behavior: Artificial Humans
Privacy and utility perceptions of social robots in healthcare
- Research Article
72
- 10.1177/154193120504900351
- Sep 1, 2005
- Proceedings of the Human Factors and Ergonomics Society Annual Meeting
HRI is an excellent candidate for simulator based research because of the relative simplicity of the systems being modeled, the behavioral fidelity possible with current physics engines and the capability of modern graphics cards to approximate camera video. In this paper we briefly introduce the USARsim simulation and discuss efforts to validate its behavior for use in Human Robot Interaction (HRI) research.
- Supplementary Content
- 10.1108/ir-02-2025-0074
- Jul 29, 2025
- Industrial Robot: the international journal of robotics research and application
Purpose This study aims to explore the integration of large language models (LLMs) and vision-language models (VLMs) in robotics, highlighting their potential benefits and the safety challenges they introduce, including robustness issues, adversarial vulnerabilities, privacy concerns and ethical implications. Design/methodology/approach This survey conducts a comprehensive analysis of the safety risks associated with LLM- and VLM-powered robotic systems. The authors review existing literature, analyze key challenges, evaluate current mitigation strategies and propose future research directions. Findings The study identifies that ensuring the safety of LLM-/VLM-driven robots requires a multi-faceted approach. While current mitigation strategies address certain risks, gaps remain in real-time monitoring, adversarial robustness and ethical safeguards. Originality/value This study offers a structured and comprehensive overview of the safety challenges in LLM-/VLM-driven robotics. It contributes to ongoing discussions by integrating technical, ethical and regulatory perspectives to guide future advancements in safe and responsible artificial intelligence-driven robotics.
- Research Article
- 10.1038/s41698-025-00916-7
- May 23, 2025
- npj Precision Oncology
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
- Research Article
83
- 10.1001/jamanetworkopen.2023.46721
- Dec 7, 2023
- JAMA network open
Recent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored. To assess the performance of LLMs on neurology board-style examinations. This cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers. Overall percentage scores of 2 LLMs. LLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2's performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological-related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers. Despite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2's results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.
- Research Article
18
- 10.1145/3606261
- Jan 30, 2024
- ACM Transactions on Human-Robot Interaction
The proliferation of Large Language Models (LLMs) presents both a critical design challenge and a remarkable opportunity for the field of Human–Robot Interaction (HRI). While the direct deployment of LLMs on interactive robots may be unsuitable for reasons of ethics, safety, and control, LLMs might nevertheless provide a promising baseline technique for many elements of HRI. Specifically, in this article, we argue for the use of LLMs asScarecrows: “brainless,” straw-man black-box modules integrated into robot architectures for the purpose of quickly enabling full-pipeline solutions, much like the use of “Wizard of Oz” (WoZ) and other human-in-the-loop approaches. We explicitly acknowledge that these Scarecrows, rather than providing a satisfying or scientifically complete solution, incorporate a form of the wisdom of the crowd and, in at least some cases, will ultimately need to be replaced or supplemented by a robust and theoretically motivated solution. We provide examples of how Scarecrows could be used in language-capable robot architectures as useful placeholders and suggest initial reporting guidelines for authors, mirroring existing guidelines for the use and reporting of WoZ techniques.
- Research Article
3
- 10.3389/frobt.2023.1212034
- Sep 14, 2023
- Frontiers in Robotics and AI
This paper focuses on the topic of "everyday life" as it is addressed in Human-Robot Interaction (HRI) research. It starts from the argument that while human daily life with social robots has been increasingly discussed and studied in HRI, the concept of everyday life lacks clarity or systematic analysis, and it plays only a secondary role in supporting the study of the key HRI topics. In order to help conceptualise everyday life as a research theme in HRI in its own right, we provide an overview of the Social Science and Humanities (SSH) perspectives on everyday life and lived experiences, particularly in sociology, and identify the key elements that may serve to further develop and empirically study such a concept in HRI. We propose new angles of analysis that may help better explore unique aspects of human engagement with social robots. We look at the everyday not just as a reality as we know it (i.e., the realm of the "ordinary") but also as the future that we need to envision and strive to materialise (i.e., the transformation that will take place through the "extraordinary" that comes with social robots). Finally, we argue that HRI research would benefit not only from engaging with a systematic conceptualisation but also critique of the contemporary everyday life with social robots. This is how HRI studies could play an important role in challenging the current ways of understanding of what makes different aspects of the human world "natural" and ultimately help bringing a social change towards what we consider a "good life."
- Research Article
- 10.1080/13658816.2025.2577252
- Nov 1, 2025
- International Journal of Geographical Information Science
The widespread use of online geoinformation platforms, such as Google Earth Engine (GEE), has produced numerous scripts. Extracting domain knowledge from these crowdsourced scripts supports understanding of geoprocessing workflows. Small Language Models (SLMs) are effective for semantic embedding but struggle with complex code; Large Language Models (LLMs) can summarize scripts, yet lack consistent geoscience terminology to express knowledge. In this paper, we propose Geo-CLASS, a knowledge extraction framework for geospatial analysis scripts that coordinates large and small language models. Specifically, we designed domain-specific schemas and a schema-aware prompt strategy to guide LLMs to generate and associate entity descriptions, and employed SLMs to standardize the outputs by mapping these descriptions to a constructed geoscience knowledge base. Experiments on 237 GEE scripts, selected from 295,943 scripts in total, demonstrated that our framework outperformed LLM baselines, including Llama-3, GPT-3.5 and GPT-4o. In comparison, the proposed framework improved accuracy in recognizing entities and relations by up to 31.9% and 12.0%, respectively. Ablation studies and performance analysis further confirmed the effectiveness of key components and the robustness of the framework. Geo-CLASS has the potential to enable the construction of geoprocessing modeling knowledge graphs, facilitate domain-specific reasoning and advance script generation via Retrieval-Augmented Generation (RAG).
- Research Article
9
- 10.1016/j.joms.2024.11.007
- Mar 1, 2025
- Journal of Oral and Maxillofacial Surgery
Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
- Research Article
12
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Research Article
1
- 10.55041/ijsrem27928
- Jan 4, 2024
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Robotic systems often require engineers to write code to specify the desired behaviour of the robots. This process is slow, costly, and inefficient, as it involves multiple iterations and manual tuning. ChatGPT is a tool that leverages a large language model (LLM) to enable natural language interaction, code generation, and learning from feedback for robotic applications. ChatGPT allows users, who may not have technical expertise, to provide high-level instructions and feedback to the LLM, while observing the robot's performance. ChatGPT can produce code for various scenarios of robots, using the LLM's knowledge to control different robotic factors. ChatGPT can also be integrated with other platforms, such as Snapchat and Duolingo, to enhance the user experience and management. ChatGPT is a novel tool that facilitates a new paradigm in robotics, where users can communicate with and teach robots using natural language. Keywords: ChatGPT, Large Language Model, Natural Language Processing, Human Robot Interaction
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.