The tea industry, as one of the most globally important agricultural products, is characterized by pests and diseases that pose a serious threat to yield and quality. These diseases and pests often present different scales and morphologies, and some pest and disease target sizes can be tiny and difficult to detect. To solve these problems, we propose TeaViTNet, a multi-scale attention-based tea pest and disease detection model that combines CNNs and Transformers. First, MobileViT is used as the feature extraction backbone network. MobileViT captures and analyzes the tiny pest and disease features in the image via a self-attention mechanism and global feature extraction. Second, the EMA-PANet network is introduced to optimize the model’s learning and attention to the Apolygus lucorum and leaf blight regions via an efficient multi-scale attention module with cross-space learning, which improves the model’s ability to understand multi-scale information. In addition, RFBNet is embedded in the module to further expand the perceptual range and effectively capture the information of tiny features in tea leaf images. Finally, the ODCSPLayer convolutional block is introduced, aiming to focus on acquiring richer gradient flow information. The experimental results show that the TeaViTNet model proposed in this paper has an average accuracy of 89.1%, which is a significant improvement over the baseline network MobileViT and is capable of accurately detecting Apolygus lucorum and leaf blight of different scales and complexities.