Zero-shot learning (ZSL) has become increasing popular in computer vision due to its ability to recognize categories unobserved in the training data. So far, most existing ZSL approaches adopt visual representations that are either derived from pretrained networks or learned using an end-to-end architecture. However, a single group of visual representations can hardly capture all features hidden in the images, yielding incomplete visual information. In numerous real-life scenarios, multi-view visual representations are often accessible which describe the instances more comprehensively and are potential for better learning performance. In this paper, we introduce an instance-wise multi-view visual fusion (IMVF) for zero-shot learning (ZSL) model. In accordance with the consensus principle, a multi-view visual-semantic mapping is created by minimizing the disparities of seen-class semantic projections on different views. Meanwhile, a straightforward linear constraint is performed on each seen-class instance to adhere to the complementary principle so that the cross-view information exchange is well motivated. In order to mitigate the domain shift problem, the predicted unseen-class semantic projections are further refined through a multi-view manifold alignment under the consensus principle. Our proposed IMVFZSL is compared with the State-of-the-Art ZSL methods on AwA2, CUB and SUN datasets. Exciting experimental results validate the effectiveness of the IMVF mechanism. To the best of our understanding, this is an initial attempt to fuse multi-view visual representations in ZSL, which will stimulate valuable contemplation in this field.
Read full abstract