Deep learning approaches have provided state-of-the-art performance in crop mapping. Recently, several studies have combined the strengths of two dominant deep learning architectures, Convolutional Neural Networks (CNNs) and Transformers, to classify crops using remote sensing images. Despite their success, many of these models utilize patch-based methods that require extensive data labeling, as each sample contains multiple pixels with corresponding labels. This leads to higher costs in data preparation and processing. Moreover, previous methods rarely considered the impact of missing values caused by clouds and no-observations in remote sensing data. Therefore, this study proposes a lightweight multi-stage CNN-Transformer network (MCTNet) for pixel-based crop mapping using time-series Sentinel-2 imagery. MCTNet consists of several successive modules, each containing a CNN sub-module and a Transformer sub-module to extract important features from the images, respectively. An attention-based learnable positional encoding (ALPE) module is designed in the Transformer sub-module to capture the complex temporal relations in the time-series data with different missing rates. Arkansas and California in the U.S. are selected to evaluate the model. Experimental results show that the MCTNet has a lightweight advantage with the fewest parameters and memory usage while achieving the superior performance compared to eight advanced models. Specifically, MCTNet obtained an overall accuracy (OA) of 0.968, a kappa coefficient (Kappa) of 0.951, and a macro-averaged F1 score (F1) of 0.933 in Arkansas, and an OA of 0.852, a Kappa of 0.806, and an F1 score of 0.829 in California. The results highlight the importance of each component of the model, particularly the ALPE module, which enhanced the Kappa of MCTNet by 4.2% in Arkansas and improved the model’s robustness to missing values in remote sensing data. Additionally, visualization results demonstrated that the features extracted from CNN and Transformer sub-modules are complementary, explaining the effectiveness of the MCTNet.