Food image generation plays a crucial role in evaluating multiple food ingredients, predicting dietary preferences, recommending food, and computing dietary nutrition. However, this task is challenging due to the large variation in the appearance of recipe components, the difficulty in aligning multi-modal features, and the lack of diversity in generated data. To address these challenges, we propose a novel RecipeCLIP-Diffusion Food Generation Model (RD-FGM), which facilitates high-quality diverse image generation while accomplishing multi-modal feature alignment. Specifically, the RecipeCLIP model implements a multi-ingredient embedding of image-text pairs for aligning contextual features. Additionally, we devise a multi-conditional guided diffusion model that achieves data distribution learning and generation control. We evaluate RD-FGM on the large-scale Recipe1M and VIREO Food-172 Chinese datasets. Many experiments including quantitative analyses, qualitative analyses, and ablation studies demonstrate the effectiveness of RD-FGM. In addition, we further model migration experiments to evaluate its scalability for other downstream tasks such as ingredient classification. The capability to generate realistic food images from textual recipes opens up new avenues for exploring culinary creations, food and ingredients classification, promising various applications in the food industry and beyond.