Time series data are prevalent in the real world, particularly playing a crucial role in key domains such as meteorology, electricity, and finance. Comprising observations at historical time points, these data, when subjected to in-depth analysis and modeling, enable researchers to predict future trends and patterns, providing support for decision making. In current research, especially in the analysis of long time series, effectively extracting and integrating long-term dependencies with short-term features remains a significant challenge. Long-term dependencies refer to the correlation between data points spaced far apart in a time series, while short-term features focus on more recent changes. Understanding and combining these two features correctly are crucial for constructing accurate and reliable predictive models. To efficiently extract and integrate long-term dependencies and short-term features in long time series, this paper proposes a pyramid attention structure model based on multi-scale feature extraction, referred to as the MSFformer model. Initially, a coarser-scale construction module is designed to obtain coarse-grained information. A pyramid data structure is constructed through feature convolution, with the bottom layer representing the original data and each subsequent layer containing feature information extracted across different time step lengths. As a result, nodes higher up in the pyramid integrate information from more time points, such as every Monday or the beginning of each month, while nodes lower down retain their individual information. Additionally, a Skip-PAM is introduced, where a node only calculates attention with its neighboring nodes, parent node, and child nodes, effectively reducing the model’s time complexity to some extent. Notably, the child nodes refer to nodes selected from the next layer by skipping specific time steps. In this study, we not only propose an innovative time series prediction model but also validate the effectiveness of these methods through a series of comprehensive experiments. To comprehensively evaluate the performance of the designed model, we conducted comparative experiments with baseline models, ablation experiments, and hyperparameter studies. The experimental results demonstrate that the MSFformer model improves by 35.87% and 42.6% on the MAE and MSE indicators, respectively, compared to traditional Transformer models. These results highlight the outstanding performance of our proposed deep learning model in handling complex time series data, particularly in capturing long-term dependencies and integrating short-term features.