An effective network structure is essential for the classification of satellite image time series (SITS). Deep learning models have been widely used for SITS classification and achieved impressive performance, especially the architectures based on self-attention. However, the lack of efficient and comprehensive attention to valuable bands and time series structure hinders the performance to some extent. To address this problem, an end-to-end attention-aware dynamic self-aggregation network (ADSN) is proposed for SITS classification in this work, which combines two main parts: spectral focusing and spectral–temporal feature learning. The core components of ADSN are the channel attention module and dynamic self-aggregation block. Specifically, informative bands in the SITS flowing through the channel attention module can adaptively get a high weight to increase their contributions, while the attentions of some low-efficiency bands are weakened. Besides, the dynamic self-aggregation block, which integrates multiscale dynamic convolution and improved multihead attention in parallel, can simultaneously capture long- and short-distance sequence structures and position relationships to better represent temporal information. Compared with random forest (RF) and seven deep learning algorithms, the proposed model effectively learns spectral and temporal features, and the experimental results confirm that ADSN has achieved superior classification accuracy and generalization ability on two SITS datasets with extremely unbalanced samples.