Infrared and visible image fusion has drawn increasing attention of researchers in recent years, wherein the complementary information between two source images is extracted to synthesize a new fusion image with richer information. Deep neural network is the latest technique to tackle various image fusion problems. However, the natural absence of ground truth makes it very challenging to optimize deep learning models. While the current methods mainly consider to uses the two source images themselves or their visual features, to provide supervision for learning, which is easy to result in the imbalance between detail preservation and brightness distribution. To boost the performance of the fusion models, it is meaningful and necessary to establish a flexible framework that can combine the advantages of existing models to produce reliable supervision for model training. In this work, we propose a novel reference-then-supervision framework, which aims to fully leverage and exploit the available favorable reference information based on the performance of existing methods and then construct high-quality reliable supervision to assist in model building. For this purpose, we design an automatic filter to produce favorable reference and devise an adaptive enhancement method to construct reliable supervision, which helps to aggregate the advantages of various existing fusion methods for yielding visually pleasing results adapting to different complex scenarios. Extensive experiments on two commonly used datasets and our built challenging test set demonstrate that our framework can greatly improve the performance of existing fusion methods. Ablation study and empirical analysis also present the efficacy of our framework design. Furthermore, the applications on downstream pedestrian detection and object tracking tasks indicate the great potential of our framework. Our code and data are publicly available at https://github.com/zhenglab/ReferenceSupervisionIVIF.