Spectral super-resolution (SSR) aims to restore a hyperspectral image (HSI) from a single RGB image, in which deep learning has shown impressive performance. However, the majority of the existing deep-learning-based SSR methods inadequately address the modeling of spatial-spectral features in HSI. That is to say, they only sufficiently capture either the spatial correlations or the spectral self-similarity, which results in a loss of discriminative spatial-spectral features and hence limits the fidelity of the reconstructed HSI. To solve this issue, we propose a novel SSR network dubbed multistage spatial-spectral fusion network (MSFN). From the perspective of network design, we build a multistage Unet-like architecture that differentially captures the multiscale features of HSI both spatialwisely and spectralwisely. It consists of two types of the self-attention mechanism, which enables the proposed network to achieve global modeling of HSI comprehensively. From the perspective of feature alignment, we innovatively design the spatial fusion module (SpatialFM) and spectral fusion module (SpectralFM), aiming to preserve the comprehensively captured spatial correlations and spectral self-similarity. In this manner, the multiscale features can be better fused and the accuracy of reconstructed HSI can be significantly enhanced. Quantitative and qualitative experiments on the two largest SSR datasets (i.e., NTIRE2022 and NTIRE2020) demonstrate that our MSFN outperforms the state-of-the-art SSR methods. The code implementation will be uploaded at https://github.com/Matsuri247/MSFN-for-Spectral-Super-Resolution.