Achieving precise semantic segmentation for traffic scenes relies on adopting multi-spectral image fusion techniques to attain high-quality images. Many existing fusion solutions often aim to enhance the similarity between the input and fusion results at the pixel intensity and texture details stage. However, this can result in smoothness issues that limit semantic segmentation performance. To address these issues, we present a smooth representation learning optimization mechanism (SFLM) that conducts image fusion on two dimensions: inter- and intra-image levels. The former overcomes over- or under-smoothing problems via the mutual information maximization between the fusion result and image samples (i.e., negative and positive). The latter balances under and over-smoothing for fusion results by minimizing the total variation in pixel space and maximizing the total variation in gradient space based on contrast learning. In this way, the proposed method effectively overcomes the fusion quality issues, providing better feature representations for semantic segmentation in autonomous vehicles. Experimental results on four public datasets validate our method’s effectiveness, robustness, and overall superiority.