Fusing cross-modal features is significant for image understanding, which aims at describing objects inside an image by optimally combining multiple visual channels. In the literature, low-level based multimodal feature fusion have achieved impressive performance. However, the semantic gap is a big limitation, i.e., these methods cannot reflect the how humans perceive image semantic objects. Supervised learning-based methods require intolerably expensive manual labeling, which is not a good choice in practice. To alleviate these limitations, we present an image understanding method by learning weakly-supervised based cross-modal semantic translation. More specifically, we design a manifold embedding algorithm to automatically translate image-level text semantic labels into several pixel-level image regions. Subsequently, we leverage a three-level spatial pyramid model to extract both local and global features of objects from training images. Afterwards, these cross-modal features are seamlessly concatenated to form a multiple feature matrix. Afterwards, these cross-modal features are seamlessly concatenated to form a multiple feature matrix. The feature matrix can be employed to learn a kernel SVM and ranking SVM for image classification and retrieval respectively. Comprehensive experiments on image recognition, classification and retrieval have demonstrated the effectiveness of our method.