Infrared and visible image fusion (IVIF) fully integrates the complementary features of different modal images, and the fused image provides a more comprehensive and objective interpretation of the scene compared to each source image, thus attracting extensive attention in the field of computer vision in recent years. However, current fusion methods usually center their attention on the extraction of prominent features, falling short of adequately safeguarding subtle and diminutive structures. To address this problem, we propose an end-to-end unsupervised IVIF method (ESFuse), which effectively enhances fine edges and small structures. In particular, we introduce a two-branch head interpreter to extract features from source images of different modalities. Subsequently, these features are fed into the edge refinement module with the detail injection module (DIM) to obtain the edge detection results of the source image, improving the network’s ability to capture and retain complex details as well as global information. Finally, we implemented a multiscale feature reconstruction module to obtain the final fusion results by combining the output of the DIM with the output of the head interpreter. Extensive IVIF fusion experiments on existing publicly available datasets show that the proposed ESFuse outperforms the state-of-the-art(SOTA) methods in both subjective vision and objective evaluation, and our fusion results perform well in semantic segmentation, target detection, pose estimation and depth estimation tasks. The source code has been availabled.