Facial expression recognition (FER) plays a crucial role in various applications, including human–computer interaction and affective computing. However, the joint training of an FER network with multiple datasets is a promising strategy to enhance its performance. Nevertheless, widespread annotation inconsistencies and class imbalances among FER datasets pose significant challenges to this approach. This paper proposes a new multi-dataset joint training method, Sample Selection and Paired Augmentation Joint Training (SSPA-JT), to address these challenges. SSPA-JT models annotation inconsistency as a label noise problem and selects clean samples from auxiliary datasets to expand the overall dataset size while maintaining consistent annotation standards. Additionally, a dynamic matching algorithm is developed to pair clean samples of the tail class with noisy samples, which enriches the tail classes with diverse background information. Experimental results demonstrate that SSPA-JT achieved superior or comparable performance compared with the existing methods by addressing both annotation inconsistencies and class imbalance during multi-dataset joint training. It achieved state-of-the-art performance on RAF-DB and CAER-S datasets with accuracies of 92.44% and 98.22%, respectively, reflecting improvements of 0.2% and 3.65% over existing methods.