Abstract

Deep learning-based methods have achieved good performance in various recognition benchmarks mostly by utilizing single modalities. As different modalities contain complementary information to each other, multi-modal based methods are proposed to implicitly utilize them. In this paper, we propose a simple technique, called correspondence learning (CL), which explicitly learns the relationship among multiple modalities. The multiple modalities in the data samples are randomly mixed among different samples. If the modalities are from the same sample (not mixed), then they have positive correspondence, and vice versa. CL is an auxiliary task for the model to predict the correspondence among modalities. The model is expected to extract information from each modality to check correspondence and achieve better representations in multi-modal recognition tasks. In this work, we first validate the proposed method in various multi-modal benchmarks including CMU Multimodal Opinion-Level Sentiment Intensity (CMU-MOSI) and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) sentiment analysis datasets. In addition, we propose a fraud detection method using the learned correspondence among modalities. To validate this additional usage, we collect a multi-modal dataset for fraud detection using real-world samples for reverse vending machines.

Highlights

  • Advances in deep learning [1,2] have shown state-of-the-art performances in various recognition tasks [3,4,5]

  • In the garbage classification task, we show that single-modality-based models are vulnerable to fraud inputs and unseen class objects, and the learned correspondence can be used for fraud detection with high detection rates

  • We propose correspondence learning (CL) for multi-modal object recognition tasks

Read more

Summary

Introduction

Advances in deep learning [1,2] have shown state-of-the-art performances in various recognition tasks [3,4,5]. Individual sensors have limited information, and different sensors have complementary information to provide. Multi-modal systems with multiple sensors have been developed to exploit the complementary information [11,12,13,14]. In action recognition tasks [15,16,17], initial approaches use RGB image sequences and optical flow sequences as model inputs, as RGB images provide contextual information and optical flow images provide motion information. The final performances are weaker than the state-of-the-art of each task, it is a proof-of-concept for utilizing multi-modal inputs

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.