Abstract

As an important field of computer vision, object detection has been studied extensively in recent years. However, existing object detection methods merely utilize the visual information of the image and fail to mine the high-level semantic information of the object, which leads to great limitations. To take full advantage of multi-source information, a knowledge update-based multimodal object recognition model is proposed in this paper. Specifically, our method initially uses Faster R-CNN to regionalize the image, then applies a transformer-based multimodal encoder to encode visual region features (region-based image features) and textual features (semantic relationships between words) corresponding to pictures. After that, a graph convolutional network (GCN) inference module is introduced to establish a relational network in which the points denote visual and textual region features, and the edges represent their relationships. In addition, based on an external knowledge base, our method further enhances the region-based relationship expression capability through a knowledge update module. In summary, the proposed algorithm not only learns the accurate relationship between objects in different regions of the image, but also benefits from the knowledge update through an external relational database. Experimental results verify the effectiveness of the proposed knowledge update module and the independent reasoning ability of our model.

Highlights

  • IntroductionThe target of object detection is to locate the object from the complex image (video) background, separate the background, classify the object, and find the object of interest

  • The target of object detection is to locate the object from the complex image background, separate the background, classify the object, and find the object of interest.Object detection has been widely applied in many fields, such as face detection, automatic driving, defect detection in engineering, crop disease detection, medical image detection, etc.With the advance of multimodal learning, multimodal object detection has gradually become a popular research field

  • The multimodal object detection based on a knowledge update summarizes the reasons for the above problems because the reasoning results of the existing models are relatively simple and cannot match the deep relationship; the reasoning results produced may be contrary to human common sense

Read more

Summary

Introduction

The target of object detection is to locate the object from the complex image (video) background, separate the background, classify the object, and find the object of interest. In view of the complex semantic interaction between choring relationships, illustration relationships, contrast relationships, poor illustrations, image and anchoring text, traditional cross-modal retrieval mainly usessemantic statisticalinteraction analysis methods, and poor relationships. Image and text, traditional cross-modal retrieval mainly uses statistical analysis methods, Multimodal object correlation detection isanalysis a branch of cross-modal graphic retrieval. The classification diagram of current and image information by machine learning models; and (2) howthe to establish research statusmapping of cross-modal graph retrieval based deep learning is shown in Figure a reasonable between text and image. Compared with the single-modal encoder, the proposed multimodal encoder learns a common low-dimensional space to embed images and text so that the image–text matching object can dig out rich feature information. To enhance the result of reasoning, our method associates each image area with an instance in the knowledge base, which helps better describe the relationship between different objects

Related Works
Method
Multimodal
GCN Relationship Modification
Data Set
Experimental Details
Display of Results
The Comparison Heat Map of the Generalized Category Detection of the Object
Detection of Behavioral Information
This Model Has the Function of Further Reasoning
Findings
Summary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call