Various deep learning models, based on convolutional neural network (CNN), have been shown to improve the detection of early esophageal neoplasia in Barrett's esophagus. Vision transformer (ViT), derived from natural language processing, has emerged as the new state-of-the-art for image recognition, outperforming predecessors such as CNN. This pilot study explores the use of ViT to classify the presence or absence of early esophageal neoplasia in endoscopic images of Barrett's esophagus. A BO dataset of 1918 images of Barrett's esophagus from 267 unique patients was used. The images were classified as dysplastic (D-BO) or non-dysplastic (ND-BO). A pretrained vision transformer model, ViTBase16, was used to develop our classifier models. Three ViT models were developed for comparison based on imaging modality: white-light imaging (WLI), narrow-band imaging (NBI), and combined modalities. Performance of each model was evaluated based on accuracy, sensitivity, specificity, confusion matrices, and receiver operating characteristic curves. The ViT models demonstrated the following performance: WLI-ViT (Accuracy: 92%, Sensitivity: 82%, Specificity: 95%), NBI-ViT (Accuracy: 99%, Sensitivity: 97%, Specificity: 99%), and combined modalities-ViT (Accuracy: 93%, Sensitivity: 87%, Specificity: 95%). Combined modalities-ViT showed greater accuracy (94% vs 90%) and sensitivity (80% vs 70%) compared with WLI-ViT when classifying WLI images on a subgroup testing set. ViT exhibited high accuracy in classifying the presence or absence of EON in endoscopic images of Barrett's esophagus. ViT has the potential to be widely applicable to other endoscopic diagnoses of gastrointestinal diseases.