Hashing technology improves the search efficiency and reduces the storage space of data. However, building an effective modal with unsupervised cross modal retrieval and generating efficient binary code is still a challenging task, considering of some issues needed to be further discussed and researched for unsupervised multimodal hashing. Most of the existing methods ignore the discrete restriction, and manually or experientially determine the weights of each modality. These limitations may significantly reduce the retrieval accuracy of unsupervised cross-modal hashing methods. To solve these problems, we propose a robust hash modal that can efficiently learn binary code by employing a flexible and noise-resistant ℓ2,1-loss with nonlinear kernel embedding. In addition, we introduce an intermediate state mapping that facilitate later modal optimization to measure the loss between the hash codes and the intermediate states. Experiments on several public multimedia retrieval datasets validate the superiority of the proposed method from various aspects.