The emerging research line of cross-modal learning focuses on the issue of transferring feature representation manner learned from limited multimodal data with labelings to the testing phase with partial modalities. This is essentially common and practical in the remote sensing community when only modal-incomplete data are in users’ hands due to inevitable imaging or access restrictions under large-scale observation scenarios. However, most of the existing cross-modal learning methods have been designed with exclusive reliance on labeling, which can be either limited or noisy due to their costly production. To address this issue, we explore in this paper the possibility to learn cross-modal feature representation in an unsupervised fashion. By integrating the multimodal data into a fully recombined matrix form, we propose 1) the use of common subspace representation as the regression target instead of conventionally adopted binary labels, and 2) the orthogonality and manifold alignment regularization terms to shrink the solution space whilst preserving the pairwise manifold correlations. Through this manner, the modality-specific and mutual latent representations in this common subspace as well as their corresponding projections can be learned simultaneously and their optimums can be efficiently reached through a nearly one-step computation with the help of Eigen decomposition. Finally, we show the superiority of our method through extensive image classification experiments on three multimodal datasets with four remotely sensed modalities involved (i.e., hyperspectral, multispectral, synthetic aperture radar, and light detection and ranging data). The code and dataset will be made freely available at https://github.com/jingyao16/UCSL after a possible publication to encourage the reproduction of our method and further use.
Read full abstract