Combinatorial relational reasoning in neural networks used for object detection is usually static; therefore, it cannot selectively fuse visual information and semantic relations, which limits their performance. To address this problem, we propose a relational graph routing network (RGRN) that enables the dynamic interaction of visual and semantic features. The network consists of a dynamic graph network, dual path-sharing module, and relational routing interaction module. First, we used a data-driven technique to obtain the semantic information between tags from the dataset. Rich semantic information was obtained by calculating the similarity between tags. Second, the two types of semantic information were fused using a dynamic graph network to capture high-level semantic information. The visual and semantic features are then filtered and encoded through the dual path-sharing module to obtain enhanced visual and semantic features. Finally, three units were used to dynamically fuse visual and semantic information in the relational routing interaction module, which densely links the three units and routers to construct a routing space that can autonomously decide on the optimal fusion path through model learning. A series of experiments was conducted on the MS COCO dataset. RGRN achieved 54.7% box AP on object detection, which was 2.8% box AP higher than that of the Cascade Mask R-CNN. The experimental results show that the routing space enables better interaction between visual and semantic information. Therefore, our method can achieve better performance than many state-of-the-art methods.
Read full abstract