Context features are mostly used to determine the boundary of a target, which allows one to better locate an object. In this paper, we propose the fusion of the spatial attention mechanism and contextual features to simulate the recognition of objects based on the human eye, thereby improving the detection accuracy of detectors. We chose the PASCAL VOC2007+2012 general dataset to test the generality of our method and examined the improved accuracy of our proposed detector on various targets. Our method showed improved accuracy for small targets and partially overlapping targets. Our proposed model improved the detector’s accuracy by 3.34%.