The rapid development in deep learning models provide considerable advancements in emotion recognition by electroencephalogram (EEG). However, existing approaches primarily focus on temporal and frequency domain features of EEG signals, neglecting spatial domain features in model information integration. Therefore, this study proposed a novel EEG input format, called EEG spectral image (ESI), which integrates spatial domain features using Azimuthal Equidistant Projection (AEP) and frequency domain features through differential entropy (DE). To better utilize this fine-grained data format, we propose the EEG Swin Transformer (EEG-SWTNS), which combines the window attention mechanism with shifted window partitioning. Window attention mechanism can concentrate on affective information within various regions and use shifted window partitioning to disrupt aggregated emotional representations for more granular features. Different from traditional graph neural networks, this method leverages the spatial locality of region information, leading to enhanced performance in EEG-based emotion recognition. Experiments conducted on SEED and SEED IV datasets demonstrate superior performance compared to baseline methods and the state-of-the-art models. Relative improvements of 0.6% and 0.08% are observed in subject-dependent experiments, while accuracies of 80.07% and 66.72% are achieved in subject-independent experiments without utilizing transfer learning techniques.