Facial expression recognition (FER) has attracted intensive attention due to its critical role in various computer vision tasks. However, existing FER approaches suffer from either noisy annotations or expression ambiguity (high inter-class and low intra-class similarity), limiting the FER performance. To this end, we propose a robust end-to-end collaborative learning based transformer for FER (CL-TransFER) in this paper. Specifically, CL-TransFER co-trains a CNN feature extractor and a transformer feature extractor jointly to extract both rich local semantic features as well as global structural information from facial images. By enforcing the consensus between the predictions of two extractors, the CL-TransFER could suppress the influence of noisy annotations. To further tackle the expression ambiguity problem, we design a simple yet efficient self-supervised masked reconstruction (SSMR) task to pre-train the transformer feature extractor of CL-TransFER. This enhances the model's capability of learning fine-grained discriminative representations. Extensive experiments on three popular benchmarks have demonstrated the effectiveness and superiority of our method.