Specific emitter identification is a challenge in the field of radar signal processing. Its aims to extract individual fingerprint features of the signal. However, early works are all designed using either signal or time–frequency image and heavily rely on the calculation of hand-crafted features or complex interactions in high-dimensional feature space. This paper introduces the time–frequency multimodal feature fusion network, a novel architecture based on multimodal feature interaction. Specifically, we designed a time–frequency signal feature encoding module, a wvd image feature encoding module, and a multimodal feature fusion module. Additionally, we propose a feature point filtering mechanism named FMM for signal embedding. Our algorithm demonstrates high performance in comparison with the state-of-the-art mainstream identification methods. The results indicate that our algorithm outperforms others, achieving the highest accuracy, precision, recall, and F1-score, surpassing the second-best by 9.3%, 8.2%, 9.2%, and 9%. Notably, the visual results show that the proposed method aligns with the signal generation mechanism, effectively capturing the distinctive fingerprint features of radar data. This paper establishes a foundational architecture for the subsequent multimodal research in SEI tasks.