Efficient Visual Question Answering on Embedded Devices: Cross-Modality Attention With Evolutionary Quantization
Visual Question Answering (VQA) lies at the intersection of vision and language domains necessitating learning representations from multiple modalities. While the model development for VQA has witnessed tremendous growth, the efforts for its deployment on embedded devices have been lagging limiting its true potential. In this work, the authors address this challenge by designing a novel hardware-friendly architecture for VQA based on the transformer model with cross-modality attention. The memory footprint of the VQA model is optimized for on-device deployment using a distributed framework for Post Training Quantization (PTQ) formulated as a Non-Linear Programming (NLP) problem. The NLP problem is solved using an Evolutionary algorithm to determine the low-bit representation of the VQA model with minimal accuracy drop compared to the full precision model. The quantized model for VQA with a marginal accuracy drop of less than 2%, resulted in 4 times memory improvement, and over 2 times latency improvement, enabling its successful deployment on the Samsung Galaxy S23 device. The comprehensive study explores the potential of the proposed generic end-to-end pipeline from VQA model development to its deployment.
- Research Article
15
- 10.1609/aaai.v35i3.16279
- May 18, 2021
- Proceedings of the AAAI Conference on Artificial Intelligence
For stability and reliability of real-world applications, the robustness of DNNs in unimodal tasks has been evaluated. However, few studies consider abnormal situations that a visual question answering (VQA) model might encounter at test time after deployment in the real-world. In this study, we evaluate the robustness of state-of-the-art VQA models to five different anomalies, including worst-case scenarios, the most frequent scenarios, and the current limitation of VQA models. Different from the results in unimodal tasks, the maximum confidence of answers in VQA models cannot detect anomalous inputs, and post-training of the outputs, such as outlier exposure, is ineffective for VQA models. Thus, we propose an attention-based method, which uses confidence of reasoning between input images and questions and shows much more promising results than the previous methods in unimodal tasks. In addition, we show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection of the VQA models. Thanks to the simplicity, attention-based anomaly detection and the regularization are model-agnostic methods, which can be used for various cross-modal attentions in the state-of-the-art VQA models. The results imply that cross-modal attention in VQA is important to improve not only VQA accuracy, but also the robustness to various anomalies.
- Research Article
4
- 10.2298/csis201120032l
- Jan 1, 2021
- Computer Science and Information Systems
Visual Question Answering (VQA) is a multimodal research related to Computer Vision (CV) and Natural Language Processing (NLP). How to better obtain useful information from images and questions and give an accurate answer to the question is the core of the VQA task. This paper presents a VQA model based on multimodal encoders and decoders with gate attention (MEDGA). Each encoder and decoder block in the MEDGA applies not only self-attention and crossmodal attention but also gate attention, so that the new model can better focus on inter-modal and intra-modal interactions simultaneously within visual and language modality. Besides, MEDGA further filters out noise information irrelevant to the results via gate attention and finally outputs attention results that are closely related to visual features and language features, which makes the answer prediction result more accurate. Experimental evaluations on the VQA 2.0 dataset and the ablation experiments under different conditions prove the effectiveness of MEDGA. In addition, the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds many existing methods.
- Conference Article
1
- 10.1117/12.2588837
- Jan 20, 2021
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the dynamic interaction between different objects. This task inherently requires reasoning the visual relationships among the objects of image. Meanwhile, the visual reasoning process should be guided by the information of the question. In this paper, we proposed a semantic relation graph reasoning network, the process of semantic relation reasoning is guided by the cross-modal attention mechanism. In addition, a Gated Graph Convolutional Network (GGCN) constructed based on cross-modal attention weights that novelly injects the semantic interaction information between objects into their visual features, and the features with relational awareness are produced. In particular, we trained a semantic relationship detector to extract the semantic relationship between objects for constructing the semantic relation graph. Experiments demonstrate that proposed model outperforms most state-of-the-art methods on the VQA v2.0 benchmark datasets.
- Conference Article
7
- 10.1109/dicta52665.2021.9647287
- Nov 1, 2021
While querying of structured geo-spatial data such as Google Maps has become commonplace, there remains a wealth of unstructured information in overhead imagery that is largely inaccessible to users. This information can be made accessible using machine learning for Visual Question Answering (VQA) about remote sensing imagery. We propose a novel method for Earth observation based on answering natural language questions about satellite images that uses cross-modal attention between image objects and text. The image is encoded with an object-centric feature space, with self-attention between objects, and the question is encoded with a language transformer network. The image and question representations are fed to a cross-modal transformer network that uses cross-attention between the image and text modalities to generate the answer. Our method is applied to the RSVQA remote sensing dataset and achieves a significant accuracy increase over the previous benchmark.
- Research Article
15
- 10.1016/j.jag.2023.103427
- Jul 28, 2023
- International Journal of Applied Earth Observation and Geoinformation
Improving visual question answering for remote sensing via alternate-guided attention and combined loss
- Conference Article
27
- 10.18653/v1/p19-1351
- Jan 1, 2019
Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.
- Conference Article
209
- 10.1109/cvpr.2018.00640
- Jun 1, 2018
Visual question answering (VQA) and visual question generation (VQG) are two trending topics in the computer vision, but they are usually explored separately despite their intrinsic complementary relationship. In this paper, we propose an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance. With our proposed invertible bilinear fusion module and parameter sharing scheme, our iQAN can accomplish VQA and its dual task VQG simultaneously. By jointly trained on two tasks with our proposed dual regularizes (termed as Dual Training), our model has a better understanding of the interactions among images, questions and answers. After training, iQAN can take either question or answer as input, and output the counterpart. Evaluated on the CLEVR and VQA2 datasets, our iQAN improves the top-1 accuracy of the prior art MUTAN VQA method by 1.33% and 0.88% (absolute increase) respectiely. We also show that our proposed dual training framework can consistently improve model performances of many popular VQA architectures <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
- Conference Article
6
- 10.1109/wacv56688.2023.00436
- Jan 1, 2023
The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case- by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.
- Conference Article
31
- 10.1109/isscc42615.2023.10067842
- Feb 19, 2023
Human perception is multimodal and able to comprehend a mixture of vision, natural language, speech, etc. Multimodal Transformer (MuIT, Fig. 16.1.1) models introduce a cross-modal attention mechanism to vanilla transformers to learn from different modalities, achieving excellent results on multimodal AI tasks like video question answering and multilingual image retrieval. Transformers require specialized hardware for efficient inference [1]. Prior work demonstrates that a Compute-In-Memory (CIM) accelerator with attention sparsity can efficiently process vanilla transformers [2]. Multimodal signals like video and audio exhibit diverse token significance, providing new opportunities for token sparsity via runtime pruning [3]. Additionally, activation functions like GELU and softmax produce many near-zero values that expose bit sparsity in the most-significant bits (MSB). In utilizing attention-token-bit hybrid sparsity, there are three challenges: 1) For attention sparsity, irregular patterns result in long reuse distance, which requires CIM to hold infrequently used weights, lowering CIM utilization. 2) Although token sparsity reduces computation, MuIT's cross-modal attention processes tokens from two modalities with different token lengths (N) and embedding dimensionality <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\mathrm{d}_{\mathrm{m}})$</tex> , causing high latency in cross-modal switch. 3) At the bit level, since token sparsity reduces value locality, a CIM macro has more variance in effective bitwidth for the same group of inputs. In a conventional CIM's bit-serial MAC scheme, computation time is defined by the longest bitwidth.
- Research Article
20
- 10.1109/tcsvt.2022.3212463
- Mar 1, 2023
- IEEE Transactions on Circuits and Systems for Video Technology
Spatiotemporal attention learning remains a challenging video question answering (VideoQA) task as it requires a sufficient understanding of cross-modal spatiotemporal information. Existing methods usually leverage different cross-modal attention mechanisms to reveal potential associations between video and question. While these methods effectively remove irrelevant information from the spatiotemporal attention, they ignore the pseudo-related information within the cross-modal interaction attention. To address this problem, we proposed a novel energy-based refined-attention mechanism (ERM). ERM leverages the significant difference distribution as a discriminative criterion derived from question-guided cross-modal interaction information to determine question-related and question-irrelated cross-modal interaction information. The specific method is to measure the linear separability between the target neuron and other neurons in the neural network to confirm the importance of neurons. In addition, to solve the statistical bias caused by the differences between different modes in video tasks, the ERM proposed in this paper has learnable parameters. The correlation between different modes can be learned adaptively through learnable parameters. The advantages of the proposed ERM are that it is more flexible and modular while remaining lightweight. With the help of the ERM, we construct a lightweight VideoQA model that efficiently integrates the cross-modal feature representations in an energy-based manner. To evaluate the effectiveness of our method, we carried out extensive experiments on five publicly available datasets and compared them with state-of-the-art VideoQA methods. The experiment results demonstrate that our method brings a noticeable performance improvement compared to state-of-the-art VideoQA methods. ERM can be flexibly integrated into different VideoQA methods to improve their Q&A performance.
- Conference Article
2
- 10.1109/lifetech48969.2020.1570619128
- Mar 1, 2020
We propose a model for free-form visual question answering (VQA) from human brain activity. The task of VQA is leading to an answer given an image and a question about the image. Given brain activity data measured by functional magnetic resonance imaging (fMRI) and a natural language question in terms of the viewed image, the proposed method can provide an accurate natural language answer with the VQA algorithm. Visual questions selectively approach various areas of an image such as objects and backgrounds. As a result, a more detailed understanding of the image and complex reasoning are typically needed than general image captioning models. In this paper, we propose a method of answering a given question about a viewed image from fMRI data based on the VQA algorithm. We estimate the relation between fMRI data and visual features extracted from viewed images. Based on the relationship, we convert fMRI data into visual features. Finally, the proposed method can answer to a visual question from fMRI data measured while subjects are viewing images. Experimental results show that the proposed method enables accurate answering for questions about viewed images.
- Research Article
12
- 10.1145/3534619
- Jul 4, 2022
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal---it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.
- Book Chapter
1
- 10.1007/978-3-030-38445-6_8
- Jan 1, 2020
This paper delineates the automation of question generation as an extension to existing Visual Question Answering (VQA) systems. Through our research, we have been able to build a system that can generate questions and answer pairs on images. It consists of two separate modules—Visual Question Generation (VQG) which generates questions based on the image, and a Visual Question Answering (VQA) module that produces a befitting answer that the VQG module generates. Through our approach, we not only generate questions but evaluate the questions generated by using a question answering system. Moreover, with our methodology, we can generate question-answer pairs as well as improve the performance of VQA models. It eliminates the need for human intervention in dataset annotation and also finds applications in the field of the educational sector, where the requirement of human input for textual questions has been essential till now. Using our system, we aim to provide an interactive interface which helps the learning process among young children.
- Conference Article
90
- 10.1109/iccv.2019.00592
- Oct 1, 2019
Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be significantly improved by combining with pre-trained language model BERT.
- Conference Article
555
- 10.1109/cvpr.2018.00380
- Jun 1, 2018
The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people.