In low light conditions, visible (VIS) images are of a low dynamic range (low contrast) with severe noise and color, while near-infrared (NIR) images contain clear textures without noise and color. Multispectral fusion of VIS and NIR images produces color images of high quality, rich textures, and little noise by taking both advantages of VIS and NIR images. In this article, we propose the deep selective fusion of VIS and NIR images using unsupervised U-Net. Existing image fusion methods are afflicted with the low contrast in VIS images and flash-like effect in NIR images. Thus, we adopt unsupervised U-Net to achieve deep selective fusion of multiple scale features. Due to the absence of the ground truth, we use unsupervised learning by formulating an energy function as a loss function. To deal with insufficient training data, we perform data augmentation by rotating images and adjusting their intensity. We synthesize training data by degrading clean VIS images and masking clean NIR images using a circle. First, we utilize pretrained visual geometry group (VGG) to extract features from VIS images. Second, we build an encoding network to obtain edge information from NIR images. Finally, we combine all features and feed them into a decoding network for fusion. Experimental results demonstrate that the proposed fusion network produces visually pleasing results with fine details, little noise, and natural color and it is superior to state-of-the-art methods in terms of visual quality and quantitative measurements.