Cervical cancer is a major global health issue, particularly in developing countries where access to healthcare is limited. Early detection of pre-cancerous lesions is crucial for successful treatment and reducing mortality rates. However, traditional screening and diagnostic processes require cytopathology doctors to manually interpret a huge number of cells, which is time-consuming, costly, and prone to human experiences. In this paper, we propose a Multi-scale Window Transformer (MWT) for cervical cytopathology image recognition. We design multi-scale window multi-head self-attention (MW-MSA) to simultaneously integrate cell features of different scales. Small window self-attention is used to extract local cell detail features, and large window self-attention aims to integrate features from smaller-scale window attention to achieve window-to-window information interaction. Our design enables long-range feature integration but avoids whole image self-attention (SA) in ViT or twice local window SA in Swin Transformer. We find convolutional feed-forward networks (CFFN) are more efficient than original MLP-based FFN for representing cytopathology images. Our overall model adopts a pyramid architecture. We establish two multi-center cervical cell classification datasets of two-category 192,123 images and four-category 174,138 images. Extensive experiments demonstrate that our MWT outperforms state-of-the-art general classification networks and specialized classifiers for cytopathology images in the internal and external test sets. The results on large-scale datasets prove the effectiveness and generalization of our proposed model. Our work provides a reliable cytopathology image recognition method and helps establish computer-aided screening for cervical cancer. Our code is available at https://github.com/nmyz669/MWT, and our web service tool can be accessed at https://huggingface.co/spaces/nmyz/MWTdemo.