An electrocardiogram (ECG) consists of complex P-QRS-T waves. Detecting long-term ECG recordings is time-consuming and error-prone for cardiologists. Deep neural networks (DNNs) can learn deep representations and empower automatic arrhythmia detection. However, when applying DNNs in practice, they usually suffer from domain shift that exits between the training data and testing data. Such shift can be caused by the high variability contained in ECG signals between patients and internal-variability of heartbeats for same patients, leading to degrading performance and impeding generalization of DNNs. To tackle this problem, we propose an unsupervised semantic-aware adaptive feature fusion network (USAFFN) to reduce such shift by alleviating the semantic distribution discrepancy between the feature spaces of two domains. Furthermore, an ECG contains rich information from different angles (beat, rhythm, and frequency levels), which is essential for arrhythmia detection. Therefore, a multi-perspective adaptive feature fusion (MPAFF) module is introduced to extract informative ECG representations. The experimental results show that the detection performance of our approach is highly competitive with the upper bound of alternative methods on the ARDB, and the generalization is confirmed on the INCART and LTDB.