Visual Question Answering Research Articles

This paper presents a novel approach to agricultural disease diagnostics through the integration of Deep Learning (DL) techniques with Visual Question Answering (VQA) systems, specifically targeting the detection of wheat rust. Wheat rust is a pervasive and destructive disease that significantly impacts wheat production worldwide. Traditional diagnostic methods often require expert knowledge and time-consuming processes, making rapid and accurate detection challenging. We drafted a new, WheatRustDL2024 dataset (7998 images of healthy and infected leaves) specifically designed for VQA in the context of wheat rust detection and utilized it to retrieve the initial weights on the federated learning server. This dataset comprises high-resolution images of wheat plants, annotated with detailed questions and answers pertaining to the presence, type, and severity of rust infections. Our dataset also contains images collected from various sources and successfully highlights a wide range of conditions (different lighting, obstructions in the image, etc.) in which a wheat image may be taken, therefore making a generalized universally applicable model. The trained model was federated using Flower. Following extensive analysis, the chosen central model was ResNet. Our fine-tuned ResNet achieved an accuracy of 97.69% on the existing data. We also implemented the BLIP (Bootstrapping Language-Image Pre-training) methods that enable the model to understand complex visual and textual inputs, thereby improving the accuracy and relevance of the generated answers. The dual attention mechanism, combined with BLIP techniques, allows the model to simultaneously focus on relevant image regions and pertinent parts of the questions. We also created a custom dataset (WheatRustVQA) with our augmented dataset containing 1800 augmented images and their associated question-answer pairs. The model fetches an answer with an average BLEU score of 0.6235 on our testing partition of the dataset. This federated model is lightweight and can be seamlessly integrated into mobile phones, drones, etc. without any hardware requirement. Our results indicate that integrating deep learning with VQA for agricultural disease diagnostics not only accelerates the detection process but also reduces dependency on human experts, making it a valuable tool for farmers and agricultural professionals. This approach holds promise for broader applications in plant pathology and precision agriculture and can consequently address food security issues.

Read full abstract

With the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptions by understanding the content of images. This technology has broad application prospects in fields such as image retrieval, autonomous driving, and visual question answering. Currently, many researchers have proposed region-based image captioning methods. These methods generate captions by extracting features from different regions of an image. However, they often rely on local features of the image and overlook the understanding of the overall scene, leading to captions that lack coherence and accuracy when dealing with complex scenes. Additionally, image captioning methods are unable to extract complete semantic information from visual data, which may lead to captions with biases and deficiencies. Due to these reasons, existing methods struggle to generate comprehensive and accurate captions. To fill this gap, we propose the Semantic Scenes Encoder (SSE) for image captioning. It first extracts a scene graph from the image and integrates it into the encoding of the image information. Then, it extracts a semantic graph from the captions and preserves semantic information through a learnable attention mechanism, which we refer to as the dictionary. During the generation of captions, it combines the encoded information of the image and the learned semantic information to generate complete and accurate captions. To verify the effectiveness of the SSE, we tested the model on the MSCOCO dataset. The experimental results show that the SSE improves the overall quality of the captions. The improvement in scores across multiple evaluation metrics further demonstrates that the SSE possesses significant advantages when processing identical images.

Read full abstract

Visual Question Answering Research Articles

Related Topics

Articles published on Visual Question Answering

Transformer Module Networks for Systematic Generalization in Visual Question Answering.

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.

Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation.

UNK-VQA: A Dataset and a Probe Into the Abstention Ability of Multi-Modal Large Models.

Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust.

I-DINO: High-Quality Object Detection for Indoor Scenes

SOVAR: System of Visual Assistance and Recognition

LRCN: Layer-residual Co-Attention Networks for Visual Question Answering

Robust visual question answering via polarity enhancement and contrast

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering

A Picture May Be Worth a Hundred Words for Visual Question Answering

ChartLine: Automatic Detection and Tracing of Curves in Scientific Line Charts Using Spatial-Sequence Feature Pyramid Network.

HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

OphGLM: An ophthalmology large language-and-vision assistant

Robust Visual Question Answering utilizing Bias Instances and Label Imbalance

Image Captioning Based on Semantic Scenes.

The Analysis of Smarter Future

A robust visual question answering approach to reduce multimodal bias

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Question Answering Research Articles

Related Topics

Articles published on Visual Question Answering

Transformer Module Networks for Systematic Generalization in Visual Question Answering.

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering.

Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation.

UNK-VQA: A Dataset and a Probe Into the Abstention Ability of Multi-Modal Large Models.

Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust.

I-DINO: High-Quality Object Detection for Indoor Scenes

SOVAR: System of Visual Assistance and Recognition

LRCN: Layer-residual Co-Attention Networks for Visual Question Answering

Robust visual question answering via polarity enhancement and contrast

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question Answering

A Picture May Be Worth a Hundred Words for Visual Question Answering

ChartLine: Automatic Detection and Tracing of Curves in Scientific Line Charts Using Spatial-Sequence Feature Pyramid Network.

HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question Answering

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

OphGLM: An ophthalmology large language-and-vision assistant

Robust Visual Question Answering utilizing Bias Instances and Label Imbalance

Image Captioning Based on Semantic Scenes.

The Analysis of Smarter Future

A robust visual question answering approach to reduce multimodal bias