Multimodal fusion is essential for robots to fully perceive the external environment. Single modal information limits the ability of robots to recognize and grasp objects. Meanwhile, traditional cross-modal data generation methods produce poor images, resulting in the bad effects of multimodal fusion. To solve the problem of the poor image effects of multimodal generation and the lack of data for multimodal fusion, this study proposes a variational Bayesian Gaussian mixture-conditional generative adversarial network (BGM-CGAN) for generating diverse cross-modal noise data. With the variational Bayesian Gaussian mixture algorithm, a uniform distributed random noise group is generated into a single mixed variable. The generated mixed variable is then generated through a Gaussian mixture model to generate a series of Gaussian mixed noise groups. The generated mixed noise group is randomly selected into a single Gaussian noise and imported into a modal image and fused with the modal image to successfully generate a high-resolution heterogeneous modal image. The method restores the heterogeneous modal information and solves the problem of the insufficient information and poor quality of generated images under a single modal. In the end, we use a variety of evaluation indexes (IS, FID, SSIM, and PSNR) to compare the proposed BGM-CGAN and other algorithms for cross-modal image generation capabilities. Results show the effectiveness and feasibility of the proposed algorithm. In addition, the BGM-CGAN algorithm has wide application prospects and can be extended to cross-modal material retrieval, cross-modal texture recognition, and other fields.