Owing to their excellent performance and efficiency, one-stage detectors have been widely used in multimedia tasks, such as temporal action detection, object tracking, and video detection. However, misalignment between classification and regression branches limits the accuracy of the detector. Most existing works add an auxiliary branch or adopt a specific sample assignment strategy to alleviate this problem, but with little effect. In this paper, we attribute this to incomplete branch interactions and propose a comprehensive Predictive Aligned Object Detector (PAOD), which can better correlate two subtasks. Specifically, our proposed PAOD achieves a better trade-off between prediction-interactive and prediction-specific by adopting an Iterative Aggregation Module (IAM) and a Mutual Constraint Module (MCM). We also design an aligned label assignment with an adaptive metric and re-weighting mechanism to further narrow the misalignment between prediction heads. With negligible additional overhead, PAOD achieves 50.4 AP at single-model single-scale testing on the MS-COCO branch, which demonstrates the effectiveness of our proposal. Notably, PAOD consistently outperforms previous sota such as ATSS (47.7 AP), BorderDet (48.0 AP) and GFL (48.2 AP) by a large margin on COCO test-dev dataset, and achieves better performance than various dense detectors on Pascal VOC and CrowdHuman datasets. Code is available at https://github.com/JunruiXiao/PAOD.
Read full abstract