Wind energy has attracted more and more attention due to its sustainability and pollution-free nature. As wind energy is highly dependent on wind speed, wind speed forecasting is of great importance for fully harvesting wind energy and ensuring the stability and reliability of wind energy systems. By accurately predicting the future wind speed, we can make efficient power schedules and consequently improve the utilization of wind power. Although existing forecasting methods have achieved satisfactory wind forecasting performance, they may still fail to discover the intricate spatio-temporal dependencies obscured in entangled patterns. To resolve this issue and further improve the prediction accuracy, we propose Mixformer, a novel mixture Transformer with hierarchical context for spatio-temporal wind speed forecasting. Mixformer first applies seasonal-trend decomposition, and then employs an MLP to predict the trend part and an attention model to predict the seasonal part. Besides, Mixformer innovatively proposes a spatio-temporal Gaussian mixture attention (ST-GMA) layer, which fuses the periodic temporal and long-term spatial context. To capture intricate long-term spatial characteristics, Mixformer extracts global information by the Dynamic Time Warping algorithm. Moreover, with the ST-GMA layer, Mixformer builds a hierarchical encoder–decoder architecture to fully utilize the context at different scales for seasonal forecasting and further enhance the spatio-temporal modeling. Empirical experiments on four real-world benchmark datasets show that Mixformer yields the lowest prediction errors with the MAE scores of 1.66, 1.68, 3.14 and 0.70 and the RMSE scores of 2.07, 2.01, 3.99 and 0.98, and improves the prediction accuracy by 8.43%, 5.36%, 4.14%, 6.57% in term of MAE scores and 9.18%, 6.92%, 4.01%, 5.75% in term of RMSE scores compared to state-of-the-art methods, which demonstrates the superiority of Mixformer. In a nutshell, Mixformer presents a promising method for spatio-temporal wind speed forecasting and may achieve desirable performance in real-world applications of wind power systems.