While cross-media data, like text, image, audio, video and 3D model, has been the main form of big data, there is a current dearth of research on cross-media retrieval. In this paper, we focus on how to learn the common representation of heterogeneous data which is a key challenge for cross-media retrieval. Most existing approaches linearly project original low-level feature into a joint feature space for isomorphic data representation. However, linear projection cannot capture most complex cross-modal correlation with high nonlinearity. In this paper, we propose a novel feature learning algorithm, which is semi-supervised cross-modal vector-valued manifold regularization (SCVM), to explore common representation of heterogeneous data. SCVM jointly explores low-level feature correlation and semantic information in a unified framework. Based on manifold regularization, we learn cross-media features from vector-valued reproducing kernel Hilbert spaces (RKHS) by kernel transformation on both labeled and unlabeled samples; moreover, we impose smoothness constraints of possible solutions to improve retrieval accuracy. Comparing with the current state-of-the-art approaches on two public datasets, comprehensive experimental results show superior performance of our SCVM. The method is more robust and stable when extended from two media types to five media types, which is very attractive in practical application.