Abstract

Histopathological image classification is a fundamental task in pathological diagnosis workflow. It remains a huge challenge due to the complexity of histopathological images. Recently, hybrid methods combining convolutional neural networks(CNN) with vision transformers(ViT) are proposed to this field. These methods can well represent the global and local contextual information and achieve excellent classification performances. However, the downsampling operation like max-pooling which ignores the sampling theorem transmits the jagged artifacts into transformer, which would lead to an aliasing phenomenon. It makes the subsequent feature maps focus on the incorrect regions and influences the final classification results. In this work, we propose an enhanced vision transformer with wavelet position embedding to tackle this challenge. In particular, a wavelet position embedding module, which introduces the wave transform into position embedding, is employed to enhance the smoothness of discontinuous feature information by decomposing sequences into amplitude and phase in pathological feature maps. In addition, an external multi-head attention is proposed to replace self-attention in the transformer block with two linear layers. It reduces the cost of computation and excavates potential correlations between different samples. We evaluate the proposed method on three public histopathological classification challenging datasets, and perform a quantitative comparison with previous state-of-the-art methods. The results empirically demonstrate that our method achieves the best accuracy. Furthermore, it has the least parameters and a very low FLOPs. In conclusion, the enhanced vision transformer shows high classification performances and demonstrates significant potential for assisting pathologists in pathological diagnosis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call