Modeling multiple views by unsupervised representation learning is a challenging problem, since multi-view data features are different from each other, and it is unclear how to perform representation learning to fully explore discriminative information from multiple views. Recent works focus on learning consistent information by contrasting high-dimensional features, but amounts of useful view-consistent information of low-dimensional representation is ignored, and such information can benefit the learned representation on downstream tasks. In this paper, we propose a novel method called Regularized and Hybrid Multiview Coding (RHMC) for comprehensively modeling consistent information between multiple views. We induce the discriminative improvement of features by constraining the learned multi-view representations in the latent space with diversified self-supervised learning tasks. Specifically, RHMC proposes a hybrid MI estimation, which jointly considers the mutual information between the aggregated high-dimensional feature and low-dimensional representation of views. RHMC enables the learned representation to model the consistent information that is hidden in low-dimensional representations under the direction of the aggregated high-dimensional feature. In hidden space, the problem of multi-view domain shift degenerates the performance of mutual information estimation. To tackle the issue, RHMC imposes globally aligned structures of the learned representations in the latent space, that is aligning the probability distribution of different views by utilizing the Wasserstein distance-based view alignment regularization. Empirically, our experimental validation supports the ability of RHMC to outperform the state-of-the-art self-supervised methods by a significant margin, which proves that RHMC can indeed model consistent and discriminative information from multi-view data.
Read full abstract