Availability of high-quality large annotated data is indeed a significant challenge in healthcare. In addition, privacy concerns and data-sharing restrictions often hinder access to large and diverse medical image datasets. To reduce the requirement for annotated training data, self-supervised pre-training strategies on nonannotated data have been extensively used, whereas collaborative algorithm training without the need to exchange the underlying data. In this paper, we introduce a novel federated learning-based self-supervised spatial–temporal transformer’s fusion (SSFL) for cardiovascular image segmentation. The integration of spatial–temporal swin transformer is used to extract the features from 3D SAX multiple phases (full cycle of cardiac heart). An efficient self-supervised contrastive framework consisting of a spatial–temporal transformer network with 25 encoders is used to model the temporal features. The spatial and temporal features are fused and forwarded to the decoder for cardiac heart segmentation using cine MRI images. To further improve segmentation, we use an attention-based unpaired GAN model to map or transfer the style from ACDC to M&Ms and use synthetically generated volumes in the proposed self-supervised approach. Experiments with three different cardiovascular image segmentation tasks, such as segmentation of the right ventricle, left ventricle, and myocardium, showed significant improvement compared to the state-of-the-art segmentation framework.