Abstract

Spatial downsampling layers are favored in convolutional neural networks (CNNs) to downscale feature maps for larger receptive fields and less memory consumption. However, for visual recognition tasks, these layers might lose discriminative details due to improper pooling strategies. In this paper, we present a unified framework (LAN) over the common downsampling layers (e.g., average pooling, max pooling, and strided convolution) from a view of local aggregation based on importance. In this LAN framework, we analyze the issues of these widely-used pooling layers and figure out the criteria of designing an effective downsampling layer. Based on this analysis, we propose a simple, general, and effective pooling operation based on local importance modeling, termed as Local Importance-based Pooling (LIP). LIP is able to enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. To further modulate different pooling windows for more effective pooling, we present the improved version of LIP, termed LIP++, by introducing an explicit margin term and efficient logit modules. Our LIP++ can yield consistent accuracy improvement over the original LIP yet with a smaller computational cost. Extensive experiments show that our presented LIP method consistently yields notable gains with different CNN architectures on the image classification task. In the challenging MS COCO dataset, detectors with our LIP-ResNets as backbones obtain a consistent performance improvement over the vanilla ResNets on both bounding box detection and instance segmentation. Finally, we also verify the effectiveness of LIP on the tasks of pose estimation and semantic segmentation, demonstrating its generalization to the dense prediction task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call