Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models.
Blind individuals, who by necessity depend on screen readers to interact with computers, face considerable challenges in navigating the diverse and complex graphical user interfaces of different computer applications. The heterogeneity of various application interfaces often requires blind users to remember different keyboard combinations and navigation methods to use each application effectively. To alleviate this significant interaction burden imposed by heterogeneous application interfaces, we present Savant, a novel assistive technology powered by large language models (LLMs) that allows blind screen reader users to interact uniformly with any application interface through natural language. Novelly, Savant can automate a series of tedious screen reader actions on the control elements of the application when prompted by a natural language command from the user. These commands can be flexible in the sense that the user is not strictly required to specify the exact names of the control elements in the command. A user study evaluation of Savant with 11 blind participants demonstrated significant improvements in interaction efficiency and usability compared to current practices.
- Research Article
11
- 10.1287/ijds.2023.0007
- Apr 1, 2023
- INFORMS Journal on Data Science
How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
- Research Article
8
- 10.1145/3465336.3475116
- Aug 25, 2021
- HT ... : the proceedings of the ... ACM Conference on Hypertext and Social Media. ACM Conference on Hypertext and Social Media
Blind users interact with smartphone applications using a screen reader, an assistive technology that enables them to navigate and listen to application content using touch gestures. Since blind users rely on screen reader audio, interacting with online videos can be challenging due to the screen reader audio interfering with the video sounds. Existing solutions to address this interference problem are predominantly designed for desktop scenarios, where special keyboard or mouse actions are supported to facilitate 'silent' and direct access to various video controls such as play, pause, and progress bar. As these solutions are not transferable to smartphones, suitable alternatives are desired. In this regard, we explore the potential of motion gestures in smartphones as an effective and convenient method for blind screen reader users to interact with online videos. Specifically, we designed and developed YouTilt, an Android application that enables screen reader users to exploit an assortment of motion gestures to access and manipulate various video controls. We then conducted a user study with 10 blind participants to investigate whether blind users can leverage YouTilt to properly execute motion gestures for video-interaction tasks while simultaneously listening to video sounds. Analysis of the study data showed a significant improvement in usability by as much as 43.3% (avg.) with YouTilt compared to that with default screen reader, and overall a positive attitude and acceptance towards motion gesture-based video interaction.
- Conference Article
14
- 10.1145/3491102.3501958
- Apr 29, 2022
Most people who are blind interact with social media content with the assistance of a screen reader, a software that converts text to speech. However, the language used in social media is well-known to contain several informal out-of-vocabulary words (e.g., abbreviations, wordplays, slang), many of which do not have corresponding standard pronunciations. The narration behavior of screen readers for such out-of-vocabulary words and the corresponding impact on the social media experience of blind screen reader users are still uncharted research territories. Therefore we seek to plug this knowledge gap by examining how current popular screen readers narrate different types of out-of-vocabulary words found on Twitter, and also, how the presence of such words in tweets influences the interaction behavior and comprehension of blind screen reader users. Our investigation showed that screen readers rarely autocorrect out-of-vocabulary words, and moreover they do not always exhibit ideal behavior for certain prolific types of out-of-vocabulary words such as acronyms and initialisms. We also observed that blind users often rely on tedious and taxing workarounds to comprehend actual meanings of out-of-vocabulary words. Informed by the observations, we finally discuss methods that can potentially reduce this interaction burden for blind users on social media.
- Research Article
48
- 10.1093/jamia/ocae074
- Apr 24, 2024
- Journal of the American Medical Informatics Association : JAMIA
Large language models for biomedicine: foundations, opportunities, challenges, and best practices.
- Conference Article
12
- 10.1145/3441852.3471205
- Oct 17, 2021
Screen reader plugins are small pieces of code that blind users can download and install to enhance the capabilities of their screen readers. In this paper, we aim to understand the user experience of screen readers’ plugins, as well as their developers, distribution model, and maintenance. To this end, we conducted a study with 14 blind screen reader users. Our study revealed that screen reader users rely on plugins for various reasons, e.g., to improve the usability of both screen readers and application software, to make partially accessible applications accessible, and to enable custom shortcuts and commands. Furthermore, installing plugins is easy; uninstalling them is unlikely; and finding them online is ad hoc, challenging, and poses security threats. In addition, developing screen reader plugins is technically demanding; only a handful of people develop plugins, and they are well-recognized in the community. Finally, there is no central repository for plugins for most screen readers, and most plugins do not receive updates from their developers and become obsolete. The lack of financial incentives plays in the slow growth of the plugin ecosystem. Based on our findings, we recommend creating a central repository for all plugins, engaging third-party developers, and raising general awareness about the benefits and dangers of plugins. We believe our findings will inspire researchers to embrace the plugin-based distribution model as an effective way to combat application-level accessibility issues.
- Conference Article
1
- 10.4271/2025-01-7146
- Feb 21, 2025
- SAE technical papers on CD-ROM/SAE technical paper series
Benchmarking Large Language Models for Motorway Driving Scenario Understanding
- Research Article
6
- 10.3390/electronics13112002
- May 21, 2024
- Electronics
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including conversation, in-context learning, reasoning, and code generation. This paper explores the potential application of LLMs in radiological information systems (RIS) and assesses the impact of integrating LLMs on RIS development and human–computer interaction. We present ChatUI-RIS, a prototype chat-based user interface that leverages LLM capabilities to enhance RIS functionality and user experience. Through an exploratory study involving 26 medical students, we investigate the efficacy of natural language dialogue for learning and operating RIS. Our findings suggest that LLM integration via a chat interface can significantly improve operational efficiency, reduce learning time, and facilitate rapid expansion of RIS capabilities. By interacting with ChatUI-RIS using natural language instructions, medical students can access and retrieve radiology information in a conversational manner. The LLM-powered chat interface not only streamlines user interactions, but also enables more intuitive and efficient navigation of complex RIS functionalities. Furthermore, the natural language processing capabilities of LLMs can be harnessed to automatically generate code snippets and database queries, accelerating RIS development and customization. Preliminary observations indicate that integrating LLMs in RIS has the potential to revolutionize user interface design, enhance system capabilities, and ultimately improve the overall user experience for radiologists and medical professionals.
- Research Article
- 10.1007/s10015-025-01031-3
- Jun 13, 2025
- Artificial Life and Robotics
Large Language Models (LLMs) are a type of machine learning model trained on vast amounts of natural language that have demonstrated novel capabilities in tasks such as text prediction and generation. These tasks allow LLMs to be remarkably suited for understanding the semantics of natural language, which in turn enables applications such as planning real world tasks, writing code for computers, and translating between human languages. Even though LLMs could provide more flexibility in interpreting user requests and have shown to possess some commonsense knowledge, their capabilities for translating natural language instructions into code to control robot actions is only starting to be explored. More specifically, in this paper we are interested in the control of robots tasked with preparing cocktails. Within this context, it is assumed that the LLM has access to a repository of well-formatted recipes. This means that each recipe is written according to the following layout: a list of ingredients, then a subsequent description of how to prepare and mix the various items. Moreover, a set of low-level modules responsible for robot manipulation and vision-related tasks is also provided to the LLM in the shape of an application programming interface (API). Consequently, the main focus of the LLM is on generating a sequence of calls to the API, along with the right parameters, to produce the cocktail requested by users in natural language. Here, we show that it is feasible for LLMs to perform this type of translation on a small number of custom modules, and that certain techniques provide a measurable benefit to the accuracy and consistency of this task without fine-tuning. We found in particular that the use of an ensemble-voting strategy, where multiple trials are repeated and the most common answer is selected, increases accuracy to a certain extent. In addition, there is moderate support for the use of natural language parsing to adjust the prompt of the LLM prior to translation. Lastly, building on previous knowledge we also provide a set of guidelines to help design prompts to improve the accuracy of the resulting sequence of actions. In general, these results suggest that while LLMs can be used as translators of robot instructions, they are best applied in conjunction with these other strategies. The impact of these findings could influence future robotics development, as it provides directions for implementing LLMs more effectively and broadening the accessibility of robotic control to users without an extensive software background.
- Research Article
2
- 10.1145/3728947
- Jun 22, 2025
- Proceedings of the ACM on Software Engineering
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. As an alternative to natural language, input-output (I/O) examples provide an accessible, unambiguous, and flexible way to describe functionalities. However, their inherent diversity, opaqueness, and incompleteness impose greater challenges for understanding and implementing the target requirements. Therefore, generating code from I/O examples (i.e., example-based code generation) provides a new perspective, allowing us to additionally evaluate LLMs’ capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to the given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 172 diverse target functionalities (derived from HumanEval and CodeHunt). The results demonstrate that when requirements are described using iterative I/O examples rather than natural language, the LLMs’ score decreases by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. Notably, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of the iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements. Furthermore, we find that combining I/O examples with even imprecise and fragmental natural language descriptions greatly improves LLM performance, and the selection of initial I/O examples can also influence the score, suggesting opportunities for prompt optimization. These findings highlight the importance of early prompts during interactions and offer critical insights and implications for enhancing LLM-based code generation.
- Research Article
- 10.2196/77561
- Jan 15, 2026
- JMIR medical informatics
Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored. This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines. We reformatted structured clinical variables from the Parkinson's Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)-based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F1-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability. On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F1-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F1-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F1-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F1-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F1-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation. This study provides an exploratory benchmark of how modern LLMs process structured clinical variables in natural language form. While several models achieved diagnostic performance comparable to LR across both the test and temporal validation datasets, their outputs were sensitive to prompting formats, model choice, and class distributions. Occasional variability across repeated output generations reflected the stochastic nature of LLMs, and lightweight models required supervised fine-tuning for stable generalization. These findings highlight the capabilities and limitations of current LLMs in handling tabular clinical information and underscore the need for cautious application and further investigation.
- Research Article
2
- 10.1002/spe.70027
- Oct 16, 2025
- Software: Practice and Experience
Context/Problem Statement No‐Code Development Platforms (NCDPs) empower non‐technical end users to build applications tailored to their specific demands without writing code. While NCDPs lower technical barriers, users still require some technical knowledge, for example, to structure process steps or define event‐action rules. Large Language Models (LLMs) offer a promising solution to further reduce technical requirements by supporting natural language interaction and dynamic code generation. By integrating LLMs, NCDPs can be more accessible to non‐technical users, enabling application development truly without requiring any technical expertise. Despite growing interest in LLM‐powered NCDPs, a systematic investigation into the factors influencing LLM suitability and performance remains absent. Understanding these factors is critical to effectively leveraging LLMs capabilities and maximizing their impact. Objective In this paper, we aim to investigate key factors influencing the effectiveness of LLMs in supporting end‐user application development within NCDPs. Methods We conducted comprehensive experiments evaluating four key factors, i.e., model selection, prompt language, training data background, and an error‐informed few‐shot setup, on the quality of generated applications. Specifically, we selected a range of LLMs based on architecture, scale, design focus, and training data, and evaluated them across four real‐world smart home automation scenarios implemented on a representative open‐source LLM‐powered NCDP. Results Model selection emerged as the most critical factor influencing performance. General‐purpose LLMs with strong natural language understanding generally outperformed others. Prompt language effects varied by model and task complexity: original prompts worked best for advanced multilingual LLMs, whereas translation steps improved performance for lighter or less capable models. LLMs showcased outperforming performance when their linguistic background aligned with the prompt language. In addition, incorporating an error‐informed few‐shot approach enhanced LLM performance, particularly for coding‐oriented and medium‐performing models, though its benefits were secondary to model choice and required additional engineering effort. Conclusion Our findings provide practical insights into how LLMs can be effectively integrated into NCDPs, informing both platform design and the selection of suitable LLMs for end‐user application development.
- Research Article
- 10.55041/ijsrem36608
- Aug 10, 2024
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
This research paper delves into the inherent vulnerabilities and potential threats posed by large language models (LLMs), focusing on their implications across diverse applications such as natural language processing and data privacy. The study aims to identify and analyze these risks comprehensively, emphasizing the importance of mitigating strategies to prevent exploitation and misuse in LLM deployments. In recent years, LLMs have revolutionized fields like automated content generation, sentiment analysis, and conversational agents, yet their immense capabilities also raise significant security concerns. Vulnerabilities such as bias amplification, adversarial attacks, and unintended data leakage can undermine trust and compromise user privacy. Through a systematic examination of these challenges, this paper proposes safeguarding measures crucial for responsibly harnessing the potential of LLMs while minimizing associated risks. It underscores the necessity of rigorous security protocols, including robust encryption methods, enhanced authentication mechanisms, and continuous monitoring frameworks. Furthermore, the research discusses regulatory implications and ethical considerations surrounding LLM usage, advocating for transparency, accountability, and stakeholder engagement in policy- making and deployment practices. By synthesizing insights from current literature and real-world case studies, this study provides a comprehensive framework for stakeholders—developers, policymakers, and users—to navigate the complex landscape of LLM security effectively. Ultimately, this research aims to inform future advancements in LLM technology, ensuring its safe and beneficial integration into various domains while mitigating potential risks to individuals and society as a whole. Keywords— Adversarial attacks on LLMs, Bias in LLMs, Data privacy in LLMs, Ethical considerations LLMs, Exploitation of LLMs, Large Language Models (LLMs), Misuse of LLMs, Mitigation strategies for LLMs, Natural Language Processing (NLP), Regulatory frameworks LLMs, Responsible deployment of LLMs, Risks of LLMs, Security implications of LLMs, Threats to LLMs, Vulnerabilities in LLMs.
- Research Article
27
- 10.1016/j.apjo.2024.100085
- Jul 1, 2024
- Asia-Pacific Journal of Ophthalmology
Understanding natural language: Potential application of large language models to ophthalmology
- Research Article
176
- 10.3390/app14052074
- Mar 1, 2024
- Applied Sciences
Natural language processing (NLP) has significantly transformed in the last decade, especially in the field of language modeling. Large language models (LLMs) have achieved SOTA performances on natural language understanding (NLU) and natural language generation (NLG) tasks by learning language representation in self-supervised ways. This paper provides a comprehensive survey to capture the progression of advances in language models. In this paper, we examine the different aspects of language models, which started with a few million parameters but have reached the size of a trillion in a very short time. We also look at how these LLMs transitioned from task-specific to task-independent to task-and-language-independent architectures. This paper extensively discusses different pretraining objectives, benchmarks, and transfer learning methods used in LLMs. It also examines different finetuning and in-context learning techniques used in downstream tasks. Moreover, it explores how LLMs can perform well across many domains and datasets if sufficiently trained on a large and diverse dataset. Next, it discusses how, over time, the availability of cheap computational power and large datasets have improved LLM’s capabilities and raised new challenges. As part of our study, we also inspect LLMs from the perspective of scalability to see how their performance is affected by the model’s depth, width, and data size. Lastly, we provide an empirical comparison of existing trends and techniques and a comprehensive analysis of where the field of LLM currently stands.
- Conference Article
133
- 10.1145/3510003.3510203
- May 21, 2022
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language model [7] are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool Jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, Jigsaw has an important role to play in improving the accuracy of the systems.