Automatic poetry generation represents a typical exhibition of artificial intelligence creativity, and the cross-modal generation methods reveal a promising direction for improvement. Although previous methods have made some progress, they still suffer from the following challenges: (1) lack of annotated multimodal Chinese poetry datasets; (2) insufficient diversity of generated poetry; (3) inadequate semantic consistency between images and poems. In this paper, we propose a novel Unsupervised Image to Poetry Model (UI2P) with a newly designed generative adversarial network to address the above issues. Specifically, the unsupervised learning framework eliminates the dependence on annotated multimodal poetry datasets. We present a contrastive learning approach to optimize the diversity of generated poems. Furthermore, a consistency strategy is developed, including constructing a modern-classical concept dictionary to ensure semantic coherence between poems and images. Extensive experiments are conducted on the CCPC dataset, and the results with both automatic and manual evaluations demonstrate the superiority of our model compared with several state-of-the-art baselines.