The fusion of infrared and visible images aims to synthesize an image that combines the advantageous characteristics of both image sources. Nevertheless, existing image fusion algorithms often struggle to extract relevant feature information from the source images and find it challenging to balance the significance of different modality information within the source images. This leads to suboptimal fusion results and inadequately serve advanced downstream visual tasks. In order to confront this challenge, a novel image fusion algorithm, denoted as Multiple Information Supervised Progressive Fusion Network (MISP-Fuse), has been proposed. Specifically, MISP-Fuse employs an innovative and robust multi-stage encoder-decoder network called Full Scale Feature Residual network (FSFR) to extract spatial context and detailed feature information corresponding to infrared and visible images. Within this encoder-decoder network, source image feature information is progressively extracted at different stages, which are recombined at multiple scales and distilled for crucial information. Finally, the fusion image is generated through a spatial localization module named as Spatial Localization Network (SLNet). MISP-Fuse incorporates a multi-information supervision mechanism to establish a linkage between different modality information in the infrared and visible source images. It ensures that the resulting fused image not only aligns with human visual perception but also effectively serves advanced downstream visual tasks. The comparative experiments utilizing diverse image fusion benchmark datasets has been conducted. In comparison to other algorithms, MISP-Fuse demonstrated significant enhancements in comprehensive image fusion metrics, including Average Gradient (AG), Sum of Correlated Differences (SCD), and Correlation Coefficient (CC).