Abstract

Context modeling presents a pioneering approach for confusing object detection. Although FPN has provided features for multi-scale objects, the feature captures limited feature for spatial context and a little feature for semantic context. In this work, we exploit an end-to-end Dilated and Deformable Feature Pyramid Network, namely DDFPN, to jointly extract spatial and semantic context. For the spatial context, we present Dilated and Deformable Convolution (DDC) to generate a more flexible receptive field than the conventional convolution of FPN. We design a Multi-scale DDC module to learn features for the various deformable objects. For the semantic context, we notice semantic context can be extracted from both features and predictions, and we design two modules to estimate two context relationships from them. The Cross Feature Correlation (CFC) module can estimate the contextual attention from other features. The Co-occurrence Inference (CI) module can learn the co-occurrence features from other predictions. Our network can be applied to various baselines of the FPN family and has similar FLOPs, parameters, and inference speed with these baselines. On MSCOCO minival and test-dev datasets, experiments show that our DDFPN is consistently better than various baselines, including RetinaNet, Faster R-CNN, Mask R-CNN, and Cascade R-CNN. Ablation exemplars show that our contexts are complementary to detect various confusing objects.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call