MAIDR Meets AI: Exploring Multimodal LLM-Based Data Visualization Interpretation by and with Blind and Low-Vision Users

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

This paper investigates how blind and low-vision (BLV) users interact with multimodal large language models (LLMs) to interpret data visualizations. Building upon our previous work on the multimodal access and interactive data representation (MAIDR) framework, our mixed-visual-ability team co-designed maidrAI, an LLM extension providing multiple AI responses to users’ visual queries. To explore generative AI-based data representation, we conducted user studies with 8 BLV participants, tasking them with interpreting box plots using our system. We examined how participants personalize LLMs through prompt engineering, their preferences for data visualization descriptions, and strategies for verifying LLM responses. Our findings highlight three dimensions affecting BLV users’ decision-making process: modal preference, LLM customization, and multimodal data representation. This research contributes to designing more accessible data visualization tools for BLV users and advances the understanding of inclusive generative AI applications.

Similar Papers
  • Conference Article
  • Cite Count Icon 29
  • 10.1145/3544548.3581532
Exploring Chart Question Answering for Blind and Low Vision Users
  • Apr 19, 2023
  • Jiho Kim + 3 more

Data visualizations can be complex or involve numerous data points, making them impractical to navigate using screen readers alone. Question answering (QA) systems have the potential to support visualization interpretation and exploration without overwhelming blind and low vision (BLV) users. To investigate if and how QA systems can help BLV users in working with visualizations, we conducted a Wizard of Oz study with 24 BLV people where participants freely posed queries about four visualizations. We collected 979 queries and mapped them to popular analytic task taxonomies. We found that retrieving value and finding extremum were the most common tasks, participants often made complex queries and used visual references, and the data topic notably influenced the queries. We compile a list of design considerations for accessible chart QA systems and make our question corpus publicly available to guide future research and development.

  • Research Article
  • Cite Count Icon 9
  • 10.1109/tcsvt.2016.2642825
Multimodal Visual Data Registration for Web-Based Visualization in Media Production
  • Apr 1, 2018
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Hansung Kim + 3 more

Recent developments of video and sensing technology have led to large volumes of digital media data. Current media production relies on videos from the principal camera together with a wide variety of heterogeneous source of supporting data [photos, light detection and ranging point clouds, witness video camera, high dynamic range imaging, and depth imagery]. Registration of visual data acquired from various 2D and 3D sensing modalities is challenging because current matching and registration methods are not appropriate due to differences in structure, format, and noise characteristics for multimodal data. A combined 2D/3D visualization of this registered data allows an integrated overview of the entire data set. For such a visualization, a Web-based context presents several advantages. In this paper, we propose a unified framework for registration and visualization of this type of visual media data. A new feature description and matching method is proposed, adaptively considering local geometry, semiglobal geometry, and color information in the scene for more robust registration. The resulting registered 2D/3D multimodal visual data are too large to be downloaded and viewed directly via the Web browser, while maintaining an acceptable user experience. Thus, we employ hierarchical techniques for compression and restructuring to enable efficient transmission and visualization over the Web, leading to interactive visualization as registered point clouds, 2D images, and videos in the browser, improving on the current state-of-the-art techniques for Web-based visualization of big media data. This is the first unified 3D Web-based visualization of multimodal visual media production data sets. The proposed pipeline is tested on big multimodal data set typical of film and broadcast production, which are made publicly available. The<br/>proposed feature description method shows two times higher precision of feature matching and more stable registration performance than existing 3D feature descriptors.

  • Research Article
  • Cite Count Icon 5
  • 10.1098/rsta.2009.0084
Visualization of multidimensional and multimodal tomographic medical imaging data, a case study
  • Aug 13, 2009
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
  • Yan Zhang + 2 more

Multidimensional tomographic datasets contain physical properties defined over four-dimensional (e.g. spatial-temporal, spatial-spectral), five-dimensional (e.g. spatial-temporal-spectral) or even higher-dimensional domains. Multimodal tomographic datasets contain physical properties obtained with different imaging modalities. In medicine, four-dimensional data are widely used, five-dimensional data are emerging, and multimodal data are being used more often every day. Visualization is vital for medical diagnosis and surgical planning to interpret the information included in imaging data. Visualization of multidimensional and multimodal tomographic imaging data is still a challenging task. As a case study, our work focuses on the visualization of five-dimensional (spatial-temporal-spectral) brain electrical impedance tomography (EIT) data. In this paper, a task-based subset definition scheme is proposed: a task model named Cubic Task Explorer (CTE) is derived to support the visualization task exploration for medical imaging data, and a structured method for visualization system development called Task-based Multi-Dimensional Visualization (TMDV) is proposed. A prototype system named EIT5DVis is developed using the CTE model and TMDV method to visualize five-dimensional brain EIT data.

  • Research Article
  • 10.52088/ijesty.v5i3.1562
Screen Reader AI: A Conversational Web-Accessibility Assistant for Blind and Low-Vision Users
  • Jul 28, 2025
  • International Journal of Engineering, Science and Information Technology
  • Rushilkumar Patel

Blind and low-vision users continue to face significant challenges when interacting with modern dynamic and visually complex web applications. Traditional screen readers often fall short due to the rapid changes in content, single-page applications, and intricate layouts. This paper introduces Screen Reader AI, a novel conversational web accessibility assistant implemented as a browser extension, designed to provide adaptive and context-rich support for non-visual navigation. Unlike conventional screen readers, Screen Reader AI constructs and continuously updates a live semantic scene graph by integrating the Document Object Model (DOM) and the Accessibility Object Model (AOM). Leveraging multimodal vision-language reasoning powered by GPT-4o, it generates detailed visual interpretations, detects interface structures and interactive elements, and conveys this information through natural, conversational dialogue. This approach allows users to request clarifications, discover relationships between interface components, and receive proactive notifications about dynamic content updates. The system features a modular architecture that ensures compatibility with evolving AI models and web standards, while maintaining an intuitive user interface. Core capabilities include adaptive task guidance, an interactive dashboard with contextual summaries, nested menus, live feeds, and predictive navigation assistance across diverse content types such as forms and multimedia. An evaluation framework outlines expected improvements in user experience, including reduced task completion times, enhanced understanding of page layouts, and greater autonomy during browsing. Initial findings suggest that conversational interaction can decrease cognitive load by reducing repetitive commands and streamlining information retrieval. Screen Reader AI represents a paradigm shift in digital accessibility by embedding adaptive intelligence into assistive technology, empowering independence and inclusivity while making accessibility an integral part of web innovation.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-031-20627-6_12
MHDML: Construction of a Medical Lakehouse for Multi-source Heterogeneous Data
  • Jan 1, 2022
  • Qi Xiao + 8 more

In the medical field, the rapid growth of medical equipment produced a large amount of medical data which has a wide range of sources and complex structures. Besides, medical data contains essential information that contributes to data exploration. However, the existing platforms based on Data Warehouse or Data Lake cannot effectively integrate more comprehensive multi-source heterogeneous medical data and efficiently manage large-scale multi-modal medical data. This paper presents a Multi-source Heterogeneous Data of Medical Lakehouse (MHDML), the platform that integrates multiple pieces of open-source software reasonably to integrate more comprehensive multi-source heterogeneous medical data. Multi-modal data fusion is an important method of the platform to improve multi-modal data management in the medical field. Finally, we customize Restful APIs for medical data exploration tasks. Based on the real data of sepsis and knee osteoarthritis, the platform realizes more comprehensive multi-source heterogeneous medical data acquisition and effective multi-modal medical data management, providing simple operations and visual data exploration functions for medical staff.KeywordsMedical lakehouseMulti-source heterogeneous dataMulti-modal data managementData exploration

  • Research Article
  • Cite Count Icon 5
  • 10.1080/17538947.2024.2431624
Multimodal data visualization method for digital twin campus construction
  • Nov 21, 2024
  • International Journal of Digital Earth
  • Yakun Xie + 7 more

University campuses, as distinctive yet commonplace urban environments, face similar challenges of high population density and efficient management. Integrating digital twins into smart campus construction is essential for optimizing campus management. Multimodal data integration and visualization are key to building a digital twin campus. Traditional visualization methods often lack effective inter-data relationships, limiting their capacity to support robust multimodal data integration and create a multidimensional, real-time, comprehensive management view. Therefore, this paper proposes a tailored approach for multimodal data integration and visualization in digital twin campus construction. First, a comprehensive analysis of multimodal campus data identifies spatiotemporal associations among datasets. Subsequently, visualization strategies are developed based on these data characteristics and associations, enabling the multidimensional integration of 3D models, video surveillance, and wireless sensor data. Finally, a prototype system is implemented and evaluated through a case study. The results demonstrate that the proposed method efficiently integrates 2D maps, surveillance videos, and sensor data to create a dynamic, interactive, and panoramic campus view, achieving rendering efficiency exceeding 60 frames per second. Moreover, this adaptable framework can be applied to various campus or geographic contexts, offering critical technical support for multimodal visualization in IoT and big data environments.

  • Research Article
  • Cite Count Icon 5
  • 10.3390/computers11120182
Learning Explainable Disentangled Representations of E-Commerce Data by Aligning Their Visual and Textual Attributes
  • Dec 10, 2022
  • Computers
  • Katrien Laenen + 1 more

Understanding multimedia content remains a challenging problem in e-commerce search and recommendation applications. It is difficult to obtain item representations that capture the relevant product attributes since these product attributes are fine-grained and scattered across product images with huge visual variations and product descriptions that are noisy and incomplete. In addition, the interpretability and explainability of item representations have become more important in order to make e-commerce applications more intelligible to humans. Multimodal disentangled representation learning, where the independent generative factors of multimodal data are identified and encoded in separate subsets of features in the feature space, is an interesting research area to explore in an e-commerce context given the benefits of the resulting disentangled representations such as generalizability, robustness and interpretability. However, the characteristics of real-word e-commerce data, such as the extensive visual variation, noisy and incomplete product descriptions, and complex cross-modal relations of vision and language, together with the lack of an automatic interpretation method to explain the contents of disentangled representations, means that current approaches for multimodal disentangled representation learning do not suffice for e-commerce data. Therefore, in this work, we design an explainable variational autoencoder framework (E-VAE) which leverages visual and textual item data to obtain disentangled item representations by jointly learning to disentangle the visual item data and to infer a two-level alignment of the visual and textual item data in a multimodal disentangled space. As such, E-VAE tackles the main challenges in disentangling multimodal e-commerce data. Firstly, with the weak supervision of the two-level alignment our E-VAE learns to steer the disentanglement process towards discovering the relevant factors of variations in the multimodal data and to ignore irrelevant visual variations which are abundant in e-commerce data. Secondly, to the best of our knowledge our E-VAE is the first VAE-based framework that has an automatic interpretation mechanism that allows to explain the components of the disentangled item representations with text. With our textual explanations we provide insight in the quality of the disentanglement. Furthermore, we demonstrate that with our explainable disentangled item representations we achieve state-of-the-art outfit recommendation results on the Polyvore Outfits dataset and report new state-of-the-art cross-modal search results on the Amazon Dresses dataset.

  • Research Article
  • Cite Count Icon 136
  • 10.3389/frai.2024.1430984
Vision-language models for medical report generation and visual question answering: a review.
  • Nov 19, 2024
  • Frontiers in artificial intelligence
  • Iryna Hartsock + 1 more

Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on publicly available models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs, with visual and language data often fused using Transformer-based architectures to enable effective learning from multimodal data. Key areas we address include the exploration of 18 public medical vision-language datasets, in-depth analyses of the architectures and pre-training strategies of 16 recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and VQA. We also highlight current challenges facing medical VLM development, including limited data availability, concerns with data privacy, and lack of proper evaluation metrics, among others, while also proposing future directions to address these obstacles. Overall, our review summarizes the recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.

  • Research Article
  • Cite Count Icon 4
  • 10.1080/01431161.2024.2403628
Incremental learning model for sustainable agricultural land assessment using multimodal satellite data
  • Oct 5, 2024
  • International Journal of Remote Sensing
  • Chatrabhuj + 1 more

The identification of agricultural lands is crucial for sustainable development in rural areas. In this paper, an augmented model that utilizes multimodal satellite-based data samples for identifying agricultural lands via incremental learning. The Whale Optimization Algorithm (WOA) is used for augmenting collected images and data samples to enhance the accuracy of the model under real-time conditions. To identify agricultural lands, Deep Convolutional Neural Networks (CNNs) are trained on the augmented data samples. Additionally, the incorporation of Q-Learning for continuous optimization of the model to enhance its efficiency and effectiveness in identifying agricultural lands. The proposed model offers many edges over existing methods. Firstly, the use of multimodal satellite-based data samples allows for a comprehensive and accurate analysis of agricultural lands. Secondly, the incorporation of the Whale Optimization Algorithm enables the augmentation of collected data samples, leading to improved accuracy and reliability of the model. Thirdly, Deep CNNs allows the extraction of complex features from the data, leading to more accurate identification of agricultural lands. Finally, the use of Q-Learning ensures that the model is continuously optimized to improve its efficiency and effectiveness. The need for this work arises from the limitations of existing methods in accurately identifying agricultural lands. Traditional methods based on manual surveys and visual interpretation are time-consuming, expensive, and prone to errors. Moreover, existing automated methods often lack the ability to analyse multimodal satellite-based data samples and fail to provide accurate results. Based on these observations, the proposed augmented model offers a promising solution for identifying agricultural lands for sustainable development. The use of multimodal satellite-based data samples, WOA, Deep CNNs, and Q-Learning allows for an accurate and efficient analysis of agricultural lands, which can aid in sustainable development planning and decision-making operations.

  • Research Article
  • 10.21869/2223-1536-2025-15-4-137-149
The method of integrative anatomical assessment of the foot
  • Jan 28, 2026
  • Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering
  • L M Smirnova + 1 more

Purpose of research . Foot, computer plantography, radiography, integrative research, multimodal data, orthopedics, anatomy. Introduction. Integrative diagnostic methods of the foot, combining X-ray and computer plantography data, allow us to obtain a holistic view of the morphology, condition of the joints and the nature of the contact of the foot with the support in the static. The development of such methods is relevant to increase the information content and accuracy of anatomical assessment of the foot. The purpose of the work is to develop a methodology for integrative anatomical assessment of the foot. Methods . The study was performed on 50 patients aged 18-70 years who underwent computer plantography and radiography of the foot in a direct projection. During the research, radiopaque metal markers were used for spatial image binding. The plantograms were processed using previously developed software. Results . A three-stage method of integrative foot examination has been developed, including performing plantography and radiography using metal markers on the foot, image processing and their layered combination during analysis. The technique ensures accurate alignment of images through the use of markers, as well as unification of data visualization and reproducibility of the study. A set of 50 integrative foot studies was obtained. As a result of the integrative approach, the accuracy of localization of anatomical structures increases and the possibilities of complex analysis expand, which is important for planning orthopedic treatment and monitoring its effectiveness. Conclusion . The proposed technique is of interest for scientific research and clinical practice in view of obtaining a unified result of two different studies – plantography and radiography of the foot. It can be used for in-depth analysis of structural changes in the foot, evaluation of the effectiveness of therapeutic and orthopedic interventions, and the resulting dataset of integrative results can be used in educational programs and further research. The technique also opens up new perspectives for the development of artificial intelligence models in the analysis of multimodal medical data, which is especially important in the context of the development of personalized medicine.

  • Research Article
  • 10.1177/01655515241281828
Recommending physicians with multimodal data and medical knowledge graph on healthcare platforms
  • Nov 13, 2024
  • Journal of Information Science
  • Weiwei Deng + 4 more

Healthcare platforms have attracted many physicians and provided convenient medical services to patients. However, the large number of physicians brings the difficulty of finding suitable physicians for the patients. Despite attempts to develop recommendation methods to address this challenge, they fail to leverage multimodal medical data, which contain numerical, categorical, textual and visual data valuable for inferring patients’ preferences for physicians. Besides, previous methods ignore the semantic gap between patients’ health conditions and physicians’ specialties. The conditions describe the patients’ symptoms, while the specialties indicate the diseases the physicians can treat. They have different vocabularies and cannot be directly compared for generating recommendations. We put forward an innovative physician recommendation approach to effectively address the above research gaps. Our approach entails merging multimodal data with multiple network modules and employing a medical knowledge graph to fill the semantic gap. To assess the validity of our suggested approach, we perform comprehensive trials on real-world data. The trial outcomes indicate that our approach surpasses its variants and existing methods in the aspects of HR@k, MRR@k and NDCG@k.

  • Research Article
  • 10.1609/aaai.v39i27.35106
Advancements in AI for Reasoning with Complex Data
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Vivek Gupta

Artificial intelligence has made remarkable progress in reasoning over complex, structured, multimodal, and multilingual data, addressing critical challenges in domains such as finance and healthcare. This abstract underscores key advancements in tabular reasoning, temporal analysis, and structured multimodal reasoning. Key contributions include the development of TempTabQA, a benchmark for temporal question answering, along with novel methods for enhancing temporal reasoning in large language models (LLMs). Additionally, a framework for evaluating mathematical reasoning in financial documents has been introduced, establishing robust techniques for interpreting time-sensitive and quantitative data. Building on these foundations, we have developed hybrid SQL-text adaptive reasoning models (H-STAR) and knowledge-aware reasoning techniques for semi-structured tables (MMTabQA), enabling precise and efficient handling of complex queries. In the vision-language domain, our contributions include advancements in spatial reasoning for geographic data (MAPWise), methods to improve robustness in chart interpretation (FlowVQA), and evaluations of LLMs’ ability to understand visual data, such as charts. Furthermore, we have addressed challenges in multilingual and cross-modal robustness through innovations such as multilingual table synchronization (InfoSync), concurrent robustness evaluations across languages and modalities, and numerical reasoning in tabular data. Our work aims to enhance reasoning on dynamically evolving data using hybrid LLM-SQL queries, symbolic query generation, and multi-table retrieval techniques. We also plan to tackle challenges in interpreting hierarchical table structures, analyzing multiple complex chart types, and exploring diverse map types, while advancing real-world multimodal data analysis. Additionally, we plan to improve table generation in both closed/open-book scenarios and refine evaluation frameworks for structured tasks. These advancements demonstrate the potential of AI in tackling complex, multimodal data and delivering impactful real-world solutions.

  • Research Article
  • 10.3389/fpsyt.2025.1548287
Machine-learning detection of stress severity expressed on a continuous scale using acoustic, verbal, visual, and physiological data: lessons learned
  • Jun 13, 2025
  • Frontiers in Psychiatry
  • Marketa Ciharova + 14 more

BackgroundEarly detection of elevated acute stress is necessary if we aim to reduce consequences associated with prolonged or recurrent stress exposure. Stress monitoring may be supported by valid and reliable machine-learning algorithms. However, investigation of algorithms detecting stress severity on a continuous scale is missing due to high demands on data quality for such analyses. Use of multimodal data, meaning data coming from multiple sources, might contribute to machine-learning stress severity detection. We aimed to detect laboratory-induced stress using multimodal data and identify challenges researchers may encounter when conducting a similar study.MethodsWe conducted a preliminary exploration of performance of a machine-learning algorithm trained on multimodal data, namely visual, acoustic, verbal, and physiological features, in its ability to detect stress severity following a partially automated online version of the Trier Social Stress Test. College students (n = 42; M age = 20.79, 69% female) completed a self-reported stress visual analogue scale at five time-points: After the initial resting period (P1), during the three stress-inducing tasks (i.e., preparation for a presentation, a presentation task, and an arithmetic task, P2-4) and after a recovery period (P5). For the whole duration of the experiment, we recorded the participants’ voice and facial expressions by a video camera and measured cardiovascular and electrodermal physiology by an ambulatory monitoring system. Then, we evaluated the performance of the algorithm in detection of stress severity using 3 combinations of visual, acoustic, verbal, and physiological data collected at each of the periods of the experiment (P1-5).ResultsParticipants reported minimal (P1, M = 21.79, SD = 17.45) to moderate stress severity (P2, M = 47.95, SD = 15.92), depending on the period at hand. We found a very weak association between the detected and observed scores (r2 = .154; p = .021). In our post-hoc analysis, we classified participants into categories of stressed and non-stressed individuals. When applying all available features (i.e., visual, acoustic, verbal, and physiological), or a combination of visual, acoustic and verbal features, performance ranged from acceptable to good, but only for the presentation task (accuracy up to.71, F1-score up to.73).ConclusionsThe complexity of input features needed for machine-learning detection of stress severity based on multimodal data requires large sample sizes with wide variability of stress reactions and inputs among participants. These are difficult to recruit for laboratory setting, due to high time and effort demands on the side of both researcher and participant. Resources needed may be decreased using automatization of experimental procedures, which may, however, lead to additional technological challenges, potentially causing other recruitment setbacks. Further investigation is necessary, with the emphasis on quality ground truth, i.e., gold standard (self-report) instruments, but also outside of laboratory experiments, mainly in general populations and mental health care patients.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/3555041.3589730
Demonstration of ThalamusDB: Answering Complex SQL Queries with Natural Language Predicates on Multi-Modal Data
  • Jun 4, 2023
  • Saehan Jo + 1 more

ThalamusDB supports SQL queries with natural language predicates on multi-modal data. Our data model extends the relational model and integrates multi-modal data, including visual, audio, and text data, as columns. Users can write SQL queries including predicates on multi-modal data, described in natural language. In this demonstration, we show how ThalamusDB enables users to query multi-modal data. Visitors can write their own SQL queries on two real-world data sets gathered from Craigslist and YouTube.

  • Research Article
  • 10.1016/j.xgen.2025.100848
Empowering integrative and collaborative exploration of single-cell and spatial multimodal data with SGS genome browser.
  • May 1, 2025
  • Cell genomics
  • Tingting Xia + 7 more

Recent advancements in single-cell and spatial omics technologies have generated a large amount of complex, high-dimensional data, which poses significant challenges to visualization tools. We introduce SGS (single-cell and spatial genomics system), a user-friendly, collaborative, and versatile browser designed for visualizing single-cell and spatial multimodal data. SGS excels in the integrative visualization of complex multimodal data, offering an innovative genome browser, flexible visualization modes, and 3D spatially resolved transcriptomics (SRT) data visualization capabilities. Notably, SGS empowers users with advanced capabilities for comparative visualization through features like scCompare, scMultiView, and the dual-chromosome mode. It supports a variety of data formats and is compatible with established analysis tools, enabling collaborative data exploration and visualization without programming. Overall, SGS is a comprehensive browser that enables researchers to unlock novel insights from multimodal data.

Save Icon
Up Arrow
Open/Close