Existing stereo matching networks based on deep learning lack multi-level and multi-module attention and integration for feature information. Therefore, we propose an attention-guided aggregation stereo matching network to encode and integrate information multiple times. Specifically, we design a residual network based on the 2D channel attention block to adaptively calibrate weight response, improving the robustness of the feature representation. We also construct a 3D stacked hourglass structure based on the 3D channel attention block to calibrate the weight response of the 4D cost volume in the channel dimension, further enhancing the network guidance and aggregation capabilities. In addition, we introduce a 4D guided cost volume, which pre-groups the extracted image features and exploits the similarity measures in each group to guide the concatenation features, further realizing interactive learning of cost volume. The experimental results on the Scene Flow and KITTI benchmark datasets showed that the proposed network significantly improves the prediction disparity accuracy with a small increase in calculation time.