Remote sensing image fusion aims to generate a high spatial resolution hyperspectral image (HR-HSI) by integrating a low spatial resolution hyperspectral image (LR-HSI) and a high spatial resolution multispectral image (HR-MSI). While Convolutional Neural Networks (CNNs) have been widely employed in addressing the HSI-MSI fusion problem, their limited receptive field poses challenges in capturing global relationships within the feature maps. On the other hand, the computational complexity of Transformers hinders their application, especially in dealing with high-dimensional data like hyperspectral images (HSIs). To overcome this challenge, we propose an HSI-MSI fusion method based on the Pyramid Swin Transformer (PSTF). The pyramid design of the PSTF effectively extracts multi-scale information from images. The Spatial-Spectral Crossed Attention (SSCA) module, comprising the Cross Spatial Attention (CSA) and the Spectral Feature Integration (SFI) modules. The CSA module employs a cross-shaped self-attention mechanism, providing greater modeling flexibility for different spatial scales and non-local structures compared to traditional convolutional layers. Meanwhile, the SFI module introduces a global memory block (MB) to select the most relevant low-rank spectral vectors, integrating global spectral information with local spatial–spectral correlation to better extract and preserve spectral information. Additionally, the Separate Feature Extraction (SFE) module enhances the network’s ability to represent image features by independently processing positive and negative parts of shallow features, thus capturing details and structures more effectively and preventing the vanishing gradient problem. Compared with the state-of-the-art (SOTA) methods, experimental results demonstrate the effectiveness of the PSTF method.