Abstract

Video object detection (VOD) is a sophisticated visual task. It is a consensus that is used to find effective supportive information from correlation frames to boost the performance of the model in VOD tasks. In this paper, we not only improve the method of finding supportive information from correlation frames but also strengthen the quality of the features extracted from the correlation frames to further strengthen the fusion of correlation frames so that the model can achieve better performance. The feature refinement module FRM in our model refines the features through the key–value encoding dictionary based on the even-order Taylor series, and the refined features are used to guide the fusion of features at different stages. In the stage of correlation frame fusion, the generative MLP is applied in the feature aggregation module DFAM to fuse the refined features extracted from the correlation frames. Experiments adequately demonstrate the effectiveness of our proposed approach. Our YOLOX-based model can achieve 83.3% AP50 on the ImageNet VID dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call