Transformer-based deep learning methods have significantly facilitated multivariate time series classification (MTSC) tasks. However, due to the inherent operation of self-attention mechanism, most existing methods tend to overlook the internal local features and temporal invariance of time series, potentially resulting in a limited understanding of the representation and context information within the model. In contrast to global features, local features demonstrate greater specificity and detail, thereby being more conducive to capture essential texture information and local structures of time series. To ameliorate these problems, we propose CTNet, a novel network that enhances the time series representation learning by reconstructing C̲rucial T̲imestamps, aiming to improve the ability to address MTSC tasks. Specifically, we introduce a novel Transformer encoder that incorporates a highly effective Gaussian-prior mechanism to accurately capture local dependencies. Additionally, we present a data-driven mask strategy to boost model’s capability of representation learning by reconstructing crucial timestamps. During the reconstruction process, we employ context-aware positional encoding to augment the temporal invariance of the model. Extensive experiments conducted on 30 accessible UEA datasets validate the superiority of CTNet compared to previous competitive methods. Furthermore, ablation studies and visualization analyses are conducted to confirm the effectiveness of the proposed model.