Abstract

Analyzing complex system with multimodal data, such as image and text, has recently received tremendous attention. Modeling the relationship between different modalities is the key to address this problem. Motivated by recent successful applications of deep neural learning in unimodal data, in this paper, we propose a computational deep neural architecture, bimodal deep architecture (BDA) for measuring the similarity between different modalities. Our proposed BDA architecture has three closely related consecutive components. For image and text modalities, the first component can be constructed using some popular feature extraction methods in their individual modalities. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a binary-binary RBM is stacked over a Gaussian-binary RBM; for text modality a binary-binary RBM is stacked over a replicated softmax RBM. In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity between different modalities. We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets.

Highlights

  • There is a growing demand for analyzing the complex systems with great number of variables [1, 2], such as multimodal data with image and text, due to the availability of computational power and massive storage

  • Motivated by recent successful applications of deep neural learning in unimodal data, in this paper, we propose a computational deep neural architecture, bimodal deep architecture (BDA) for measuring the similarity between different modalities

  • We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets

Read more

Summary

Introduction

There is a growing demand for analyzing the complex systems with great number of variables [1, 2], such as multimodal data with image and text, due to the availability of computational power and massive storage. A travel photo shared on the website is usually tagged with some meaningful words. Analyzing those heterogeneous data of great number of variables from multiple sources could benefit different modalities. During the past few years, motivated by the biological propagation phenomena in distributed structure of human brain, deep neural learning has received considerable attention from the year of 2006. These deep neural learning methods are proposed to learn hierarchical and effective representations to facilitate various tasks with respect to recognizing and analyzing in complex artificial system. Even with only a very short development, deep neural learning has achieved great success in some tasks of modeling the single modal data, such as speech recognition systems [3,4,5,6] and computer vision systems [7,8,9,10,11,12], to name a few

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call