Gastric structure recognition systems have become increasingly necessary for the accurate diagnosis of gastric lesions in capsule endoscopy. Deep learning, especially using transformer models, has shown great potential in the recognition of gastrointestinal (GI) images according to self-attention. This study aims to establish an identification model of capsule endoscopy gastric structures to improve the clinical applicability of deep learning to endoscopic image recognition. A total of 3343 wireless capsule endoscopy videos collected at Nanfang Hospital between 2011 and 2021 were used for unsupervised pretraining, while 2433 were for training and 118 were for validation. Fifteen upper GI structures were selected for quantifying the examination quality. We also conducted a comparison of the classification performance between the artificial intelligence model and endoscopists by the accuracy, sensitivity, specificity, and positive and negative predictive values. The transformer-based AI model reached a relatively high level of diagnostic accuracy in gastric structure recognition. Regarding the performance of identifying 15 upper GI structures, the AI model achieved a macroaverage accuracy of 99.6% (95% CI: 99.5-99.7), a macroaverage sensitivity of 96.4% (95% CI: 95.3-97.5), and a macroaverage specificity of 99.8% (95% CI: 99.7-99.9) and achieved a high level of interobserver agreement with endoscopists. The transformer-based AI model can accurately evaluate the gastric structure information of capsule endoscopy with the same performance as that of endoscopists, which will provide tremendous help for doctors in making a diagnosis from a large number of images and improve the efficiency of examination.