Abstract

Deep convolutional neural networks (CNNs) are successful in self-extracting features for video object detection. The deep features and shallow features extracted from CNN are different. The shallow features have low-level semantic information, while the deep features contain high-level semantic information. In this paper, we propose an effective feature fusion method: Multi-level feature aggregation (MFA), which connects the output layer of each stage to the input layer of other stages and combines the output of each stage at the last layer of the network. This architecture can effectively combine shallow features and deep features to enhance the ability of expressing features and recognition accuracy. MFA is a flexible and end-to-end network. In addition, our experiments prove that MFA achieves significant accuracy on DET and VID datasets on object detection, and our method achieves mAP on DET and VID.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call