Insect infestations and pests inflict significant losses in agriculture, substantially augmenting the demand for automated pest detection and early pest management in the cultivation process. However, the task of multi-class pest identification, involving both localization and classification, is exceptionally challenging due to the small size, great similarity, and environmental variability of pests. This paper presents an enhanced version of our previous work, Pest-YOLO, aimed at improving accuracy while maintaining real-time pest detection. The improved Pest-YOLO incorporates two key advancements: an efficient channel attention (ECA) mechanism for improved feature extraction and a transformer encoder for capturing global features. We replace the original squeeze-excitation attention mechanism with the ECA mechanism, effectively improving the model's ability to extract essential features from pest images. Additionally, we introduce the transformer encoder to the convolutional neural network (CNN) architecture to enhance its capability to capture global contextual information. To further enhance the expressiveness of features for small targets like agricultural pests, we propose a feature fusion method called cross-stage feature fusion (CSFF). This method significantly improves the representation of small targets during the feature fusion stage. Through experiments on the Pest24 dataset, our method achieves an impressive mean average precision of 73.4%, surpassing the performance of state-of-the-art methods. These results demonstrate the effectiveness of our improved Pest-YOLO model in accurately detecting pests in real-time scenarios.