Pre-trained on large-scale datasets has profoundly promoted the development of deep learning models in medical image analysis. For medical image segmentation, collecting a large number of labeled volumetric medical images from multiple institutions is an enormous challenge due to privacy concerns. Self-supervised learning with mask image modeling (MIM) can learn general representation without annotations. Integrating MIM into FL enables collaborative learning of an efficient pre-trained model from unlabeled data, followed by fine-tuning with limited annotations. However, setting pixels as reconstruction targets in traditional MIM fails to facilitate robust representation learning due to the medical image's complexity and distinct characteristics. On the other hand, the generalization of the aggregated model in FL is also impaired under the heterogeneous data distributions among institutions. To address these issues, we proposed a novel self-supervised federated learning, which combines masked self-distillation with adaptive attention federated learning. Such incorporation enjoys two vital benefits. First, masked self-distillation sets high-quality latent representations of masked tokens as the target, improving the descriptive capability of the learned presentation rather than reconstructing low-level pixels. Second, adaptive attention aggregation with Personalized federate learning effectively captures specific-related representation from the aggregated model, thus facilitating local fine-tuning performance for target tasks. We conducted comprehensive experiments on two medical segmentation tasks using a large-scale dataset consisting of volumetric medical images from multiple institutions, demonstrating superior performance compared to existing federated self-supervised learning approaches.