Abstract

Challenges such as complex backgrounds, drastic variations in target scales, and dense distributions exist in natural scenes. Some algorithms optimize multi-scale object detection performance by combining low-level and high-level information through feature fusion strategies. However, these methods overlook the inherent spatial properties of objects and the relationships between foreground and background. To fundamentally enhance the multi-scale detection capability, we propose a depth-constrained multi-scale object detection network that simultaneously learns object detection and depth estimation through a unified framework. In this network, depth features are merged into the detection branch as auxiliary information and constrained and guided to obtain better spatial representations, which enhances discrimination between multi-scale objects. We also introduce a novel cross-modal fusion (CmF) strategy that utilizes depth awareness and low-level detail clues to supplement edge information and adjust attention weight preferences. We find complementary information from RGB and high-quality depth features to achieve better multi-modal information fusion. Experimental results demonstrate that our method outperforms state-of-the-art methods on the KINS dataset, with an improvement of 3.0% in AP score over the baseline network. Furthermore, we validate the effectiveness of our proposed method on the KITTI dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call