The DNA virus responsible for monkeypox, transmitted from animals to humans, exhibits two distinct genetic lineages in central and eastern Africa. Beyond the zoonotic transmission involving direct contact with the infected animals’ bodily fluids and blood, the spread of monkeypox can also occur through skin lesions and respiratory secretions among humans. Both monkeypox and chickenpox involve skin lesions and can also be transmitted through respiratory secretions, but they are caused by different viruses. The key difference is that monkeypox is caused by an orthopox-virus, while chickenpox is caused by the varicella-zoster virus. In this study, the utilization of a patch-based vision transformer (ViT) model for the identification of monkeypox and chickenpox disease from human skin color images marks a significant advancement in medical diagnostics. Employing a transfer learning approach, the research investigates the ViT model’s capability to discern subtle patterns which are indicative of monkeypox and chickenpox. The dataset was enriched through carefully selected image augmentation techniques, enhancing the model’s ability to generalize across diverse scenarios. During the evaluation phase, the patch-based ViT model demonstrated substantial proficiency, achieving an accuracy, precision, recall, and F1 rating of 93%. This positive outcome underscores the practicality of employing sophisticated deep learning architectures, specifically vision transformers, in the realm of medical image analysis. Through the integration of transfer learning and image augmentation, not only is the model’s responsiveness to monkeypox- and chickenpox-related features enhanced, but concerns regarding data scarcity are also effectively addressed. The model outperformed the state-of-the-art studies and the CNN-based pre-trained models in terms of accuracy.