Microstructural classification is key to better understanding of the relationships between the microstructures of carbon steels and their properties, as well as the development of automatic models to reliably and efficiently categorize the microstructures which can largely mitigate the burden of the human experts. Methods based on transfer learning with deep neural networks, such as convolutional neural networks (CNNs) have become established in automatic microstructural classification of carbon steels. However, vision transformers (ViTs), as a new class of deep learning algorithms, have recently emerged as a competitive alternative to CNNs in many other fields, but have not been considered in this field yet. In this study, the micrographs from an open-source dataset of ultrahigh carbon steel are analyzed by use of two CNNs (GoogLeNet and MobileNetV2) and two ViTs (ViT-B16 and ViT-L32). These deep neural networks are used as end-to-end classifiers to discriminate between different microstructures. Promising results were achieved by use of ViTs, as they performed as well as, or better than CNNs, though not by a large margin. These results suggest that ViTs can be a viable alternative to CNNs in the advancement of automatic microstructural classification.