In this paper, a novel skipping spatial–spectral–temporal network (S3T-Net) is developed to handle intra-individual differences in electroencephalogram (EEG) signals for accurate, robust, and generalized emotion recognition. In particular, aiming at the 4D features extracted from the raw EEG signals, a multi-branch architecture is proposed to learn spatial–spectral cross-domain representations, which benefits enhancing the model generalization ability. Time dependency among different spatial–spectral features is further captured via a bi-directional long-short term memory module, which employs an attention mechanism to integrate context information. Moreover, a skip-change unit is designed to add another auxiliary pathway for updating model parameters, which alleviates the vanishing gradient problem in complex spatial–temporal network. Evaluation results show that the proposed S3T-Net outperforms other advanced models in terms of the emotion recognition accuracy, which yields an performance improvement of 0.23% , 0.13%, and 0.43% as compared to the sub-optimal model in three test scenes, respectively. In addition, the effectiveness and superiority of the key components of S3T-Net are demonstrated from various experiments. As a reliable and competent emotion recognition model, the proposed S3T-Net contributes to the development of intelligent sentiment analysis in human–computer interaction (HCI) realm.