Abstract

There is context-related temporal and spatial information in video. How to make better use of these information is the key to improving the accuracy and speed of video object detection. The method of processing spatial and temporal information through 3D convolution is better than that of processing spatial information first and then processing time dimension information. There are two reasons for this assertion. One is that the 3D convolution can process spatial information and temporal information at the same time which will not divide them into two parts in feature extraction. The other is that the human visual derivation mechanism needs to memorize multi frames and process a video as a whole. After the training test on NFS dataset, it is verified that 3D convolution has stronger ability to extract temporal and spatial information. In addition, this paper also proposes a downward sampling method by connecting the low-resolution layer and the high-resolution layer, which can effectively improve the FPN model and enrich its high-level feature representation. After training and testing with different number of downward sample modules on MOT16 dataset, the improvement of model performance by down-sampling module is verified. Finally, this paper combines the downward sample module with the 3D network, and obtains the best test results on the NFS dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call