Scale variation is one of the primary challenges in object detection. Recently, different strategies have been introduced to address this challenge, achieving promising performance. However, limitations still exist in these detectors. On the one hand, as for the large-scale deep layers, the localizing power of the features is relatively low. On the other hand, as for the small-scale shallow layers, the categorizing ability of the features is relatively weak. Actually, the limitations are self-solving, as the above two aspects can be mutually beneficial to each other. Therefore, we propose the Stacked Pyramid Attention Network (SPANet) to bridge the gap between different scales. In SPANet, two lightweight modules, i.e. top-down feature map attention module (TDFAM) and bottom-up feature map attention module (BUFAM), are designed. Via learning the channel attention and spatial attention, each module effectively builds connections between features from adjacent scales. By progressively integrating BUFAM and TDFAM into two encoder–decoder structures, two novel feature aggregating branches are built. In this way, the branches fully complement the localizing power from small-scale features and the categorizing power from large-scale features, therefore enhancing the detection accuracy while keeping lightweight. Extensive experiments on two challenging benchmarks (PASCAL VOC and MS COCO datasets) demonstrate the effectiveness of our SPANet, showing that our model reaches a competitive trade-off between accuracy and speed.