The accuracy improvement in endoscopic image classification matters to the endoscopists in diagnosing and choosing suitable treatment for patients. Existing CNN-based methods for endoscopic image classification tend to use the deepest abstract features without considering the contribution of low-level features, while the latter is of great significance in the actual diagnosis of intestinal diseases. To make full use of both high-level and low-level features, we propose a novel two-stream network for endoscopic image classification. Specifically, the backbone stream is utilized to extract high-level features. In the fusion stream, low-level features are generated by a bottom-up multi-scale gradual integration (BMGI) method, and the input of BMGI is refined by top-down attention learning modules. Besides, a novel correction loss is proposed to clarify the relationship between high-level and low-level features. Experiments on the KVASIR dataset demonstrate that the proposed framework can obtain an overall classification accuracy of 97.33% with Kappa coefficient of 95.25%. Compared to the existing models, the two evaluation indicators have increased by 2% and 2.25%, respectively, at least. In this study, we proposed a two-stream network that fuses the high-level and low-level features for endoscopic image classification. The experiment results show that the high-to-low-level feature can better represent the endoscopic image and enable our model to outperform several state-of-the-art classification approaches. In addition, the proposed correction loss could regularize the consistency between backbone stream and fusion stream. Thus, the fused feature can reduce the intra-class distances and make accurate label prediction.