Predicting air quality using multimodal data is crucial to comprehensively capture the diverse factors influencing atmospheric conditions. Therefore, this study introduces a multimodal learning framework that integrates outdoor images with traditional ground-based observations to improve the accuracy and reliability of air quality predictions. However, aligning and fusing these heterogeneous data sources pose a formidable challenge, further exacerbated by pervasive data incompleteness issues in practice. In this paper, we propose a novel incomplete multimodal learning approach (iMMAir) to recovery missing data for robust air quality prediction. Specifically, we first design a shallow feature extractor to capture modal-specific features within the latent representation space. Then we develop a conditional diffusion-driven recovery module to mitigate the distribution gap between the recovered and true data. This module further incorporates two conditional constraints of temporal correlation and semantic consistency for effective modal completion. Finally, we reconstruct incomplete modalities and fuse available data using a multimodal transformer network to predict the air quality. To alleviate the modality imbalance problem, we employ an adaptive gradient modulation strategy to adjust the optimization of each modality. Experimental results demonstrate that iMMAir significantly reduces prediction errors, outperforming baseline models by an average of 5.6% and 2.5% in air quality regression and classification tasks. Our source code and data are available at https://github.com/pestasu/IMMAir.