Since labeled samples are typically scarce in real-world scenarios, self-supervised representation learning in time series is critical. Existing approaches mainly employ the contrastive learning framework, which automatically learns to understand similar and dissimilar data pairs. However, they are constrained by the request for cumbersome sampling policies and prior knowledge of constructing pairs. Also, few works have focused on effectively modeling temporal-spectral correlations to improve the capacity of representations. In this article, we propose the cross reconstruction transformer (CRT) to solve the aforementioned issues. CRT achieves time series representation learning through a cross-domain dropping-reconstruction task. Specifically, we obtain the frequency domain of the time series via the fast Fourier transform (FFT) and randomly drop certain patches in both time and frequency domains. Dropping is employed to maximally preserve the global context while masking leads to the distribution shift. Then a Transformer architecture is utilized to adequately discover the cross-domain correlations between temporal and spectral information through reconstructing data in both domains, which is called Dropped Temporal-Spectral Modeling. To discriminate the representations in global latent space, we propose instance discrimination constraint (IDC) to reduce the mutual information between different time series samples and sharpen the decision boundaries. Additionally, a specified curriculum learning (CL) strategy is employed to improve the robustness during the pretraining phase, which progressively increases the dropping ratio in the training process. We conduct extensive experiments to evaluate the effectiveness of the proposed method on multiple real-world datasets. Results show that CRT consistently achieves the best performance over existing methods by 2%-9%. The code is publicly available at https://github.com/BobZwr/Cross-Reconstruction-Transformer.