Background and objectives:Skin cancer is the most common cancer worldwide, and therein the malignant melanoma may lead to less than 5-year life expectancy. Via early-stage detection and recognition, even the deadliest melanoma can be cured to greatly increase the patient’s survival rate. Recently dermoscopy imaging is capable of capturing high-resolution magnified images of the infected skin region to automatic lesion classification, and deep learning network has been witnessed great potential of accurately recognizing different types of skin lesions. This study aims to exploit a novel deep model to enhance the skin lesion recognition performance. Methods:In spite of the remarkable progress, the existing deep network based methods naively deploy the proposed network architectures in generic image classification to the skin lesion classification, and there has still large space for performance improvement. This study presents an enhanced deep bottleneck transformer model, which incorporates self-attention to model the global correlation of the extracted features from the conventional deep models, for boosting the skin lesion performance. Specifically, we exploit an enhanced transformer module via incorporating a dual position encoding module to integrate encoded position vector on both key and query vectors for balance learning. By replacing the bottleneck spatial convolutions of the late-stage blocks in the baseline deep networks with the enhanced module, we construct a novel deep skin lesion classification model to lift the skin lesion classification performance. Results:We conduct extensive experiments on two benchmark skin lesion datasets: ISIC2017 and HAM10000 to verify the recognition performance of different deep models. The three quantitative metrics of accuracy, sensitivity and specificity on the ISIC2017 dataset with our method reach to 92.1%, 90.1% and 91.9%, respectively, which manifests very good balance result between the sensitivity and specificity, while the results on the accuracy and precision for the HAM10000 dataset are 95.84% and 96.1%. Conclusions:Results on both datasets have demonstrated that our proposed model can achieve superior performance over the baseline models as well as the state-of-the-art methods. This superior results using the incorporated model of the transformer of convolution module would inspire further research on the wide application of the transformer-based block for the real scenario without large-scale dataset.