In this study, we introduce the U-Swin fusion model, an effective and efficient transformer-based architecture designed for the fusion of multi-focus microscope images. We utilized the Swin-Transformer with shifted window and path merging as the encoder for extracted hierarchical context features. Additionally, a Swin-Transformer-based decoder with patch expansion was designed to perform the un-sampling operation, generating the fully focused image. To enhance the performance of the feature decoder, the skip connections were applied to concatenate the hierarchical features from the encoder with the decoder up-sample features, like U-net. To facilitate comprehensive model training, we created a substantial dataset of multi-focus images, primarily derived from texture datasets. Our modulators demonstrated superior capability for multi-focus image fusion to achieve comparable or even better fusion images than the existing state-of-the-art image fusion algorithms and demonstrated adequate generalization ability for multi-focus microscope image fusion. Remarkably, for multi-focus microscope image fusion, the pure transformer-based U-Swin fusion model incorporating channel mix fusion rules delivers optimal performance compared with most existing end-to-end fusion models.