In today’s AI-driven era, deep learning (DL) algorithms play a crucial role in automatically detecting life-threatening skin cancers, thereby significantly enhancing survival rates. It makes skin cancer detection using DL algorithms an exciting area of exploration. While much of the prior research has focused on single-model approaches, combining ensembles of multiple models can enhance classification accuracy. Previous studies mainly relied on deep convolutional neural networks (DCNNs), which have limitations in capturing global features. Recent advancements have introduced capsule networks (Caps-Net) and vision transformers (ViT) for more effective feature extraction. In our study, we harness the power of DCNN, Caps-Net, and ViT frameworks to extract diverse image embeddings. These obtained feature vectors work as input data to train an ensemble model based on a majority voting mechanism. This ensemble model consists of five machine-learning models, including Random Forest, XGBoost, SVM, KNN, and logistic regression. The incorporation of this ensemble mechanism leads to a significant improvement in the overall model’s performance. It is noteworthy that the proposed ensemble model serves as a lightweight model, which achieves an impressive accuracy of 91.6% when considering the melanoma skin cancer dataset. These results underscore the superiority of our proposed mechanism over individual state-of-the-art (SOTA) models.