Recently, supervised learning methods for intelligent mechanical fault diagnosis have achieved considerable advancements, while they heavily rely on labeled information and neglect massive unlabeled signals. Consequently, some self-supervised research has emerged and employs contrastive learning to mine general knowledge from unlabeled samples. However, these methods emphasize independent feature extraction and lack deep feature interaction, which limits better representation learning. Furthermore, it also restricts the performance improvement of fault diagnosis. To fill this gap, a novel self-supervised representation learning framework based on time-frequency alignment and interaction (TFAI) is proposed to improve the diagnosis reliability under limited labeled data. Comprising a dual encoder and a cross-modal encoder based on Transformers, the proposed TFAI model takes paired time-frequency data as input for self-supervised pretraining. Two self-supervised learning strategies, namely time-frequency alignment and time-frequency interaction, are proposed to enable the model to gain valuable knowledge from unlabeled data. The former leverages a time-frequency contrastive loss to drive the dual encoder to extract profound features from unlabeled data. The latter involves utilizing a time-frequency matching loss to guide the cross-modal encoder to perform deep feature fusion in an unsupervised manner. The well pretrained TFAI model can learn informative representations and can be fine-tuned for specific downstream fault diagnosis tasks to improve the diagnosis accuracy. By conducting three experimental cases on an axial flow pump system and a public gearbox platform system, the effectiveness, generalizability, and domain adaptability of the proposed method are validated.