Stereo matching of high-resolution satellite stereo images is a fundamental and challenging task in photogrammetry, geoscience, and remote sensing. Some deep learning-based methods have shown great potential in solving this task recently. However, these methods have not attracted much attention as their accuracy is still unsatisfactory, especially in regions with intractable factors (e.g., occlusions, repetitive patterns, low texture, non-texture, and disparity discontinuities). To tackle these challenging factors and improve disparity accuracy, we propose an end-to-end stereo matching network named Cascaded Multi-Scale Pyramid Network (CMSP-Net). Firstly, to fully utilize the multi-level spatial context information and mitigate the ambiguous matching caused by repetitive patterns, low texture, non-texture, and occlusions, a cost volume pyramid is constructed with multi-scale image features. Then, each cost volume in the pyramid is aggregated with an hourglass structure to exploit the multi-scale context in a single cost volume. Secondly, considering that high-resolution cost volumes contain more detailed information but have small receptive fields, while low-resolution ones have large receptive fields but lack details, we design an Attention-Guided Cost Fusion (AGCF) strategy to fuse pyramid cost volumes, which effectively integrates the advantages of multi-scale cost volumes while mitigating the backfiring caused by semantic inconsistency across different scales. Thirdly, a disparity refinement module is designed to improve overall disparity accuracy further, which incorporates the reconstruction error and edge clues of input satellite stereo images. Extensive experimental results on multiple satellite datasets demonstrate that the proposed method can effectively reduce matching ambiguities and recover sharp disparity discontinuities, achieving clear superiority over existing state-of-the-art methods.