A Novel Vision Transformer Model for Skin Cancer Classification

Guang Yang,Suhuai Luo,Peter Greer

doi:10.1007/s11063-023-11204-5

Abstract

Skin cancer can be fatal if it is found to be malignant. Modern diagnosis of skin cancer heavily relies on visual inspection through clinical screening, dermoscopy, or histopathological examinations. However, due to similarity among cancer types, it is usually challenging to identify the type of skin cancer, especially at its early stages. Deep learning techniques have been developed over the last few years and have achieved success in helping to improve the accuracy of diagnosis and classification. However, the latest deep learning algorithms still do not provide ideal classification accuracy. To further improve the performance of classification accuracy, this paper presents a novel method of classifying skin cancer in clinical skin images. The method consists of four blocks. First, class rebalancing is applied to the images of seven skin cancer types for better classification performance. Second, an image is preprocessed by being split into patches of the same size and then flattened into a series of tokens. Third, a transformer encoder is used to process the flattened patches. The transformer encoder consists of N identical layers with each layer containing two sublayers. Sublayer one is a multihead self-attention unit, and sublayer two is a fully connected feed-forward network unit. For each of the two sublayers, a normalization operation is applied to its input, and a residual connection of its input and its output is calculated. Finally, a classification block is implemented after the transformer encoder. The block consists of a flattened layer and a dense layer with batch normalization. Transfer learning is implemented to build the whole network, where the ImageNet dataset is used to pretrain the network and the HAM10000 dataset is used to fine-tune the network. Experiments have shown that the method has achieved a classification accuracy of 94.1%, outperforming the current state-of-the-art model IRv2 with soft attention on the same training and testing datasets. On the Edinburgh DERMOFIT dataset also, the method has better performance compared with baseline models.

Full Text