Multimodal deep representation learning for protein interaction identification and protein family classification

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

BackgroundProtein-protein interactions(PPIs) engage in dynamic pathological and biological procedures constantly in our life. Thus, it is crucial to comprehend the PPIs thoroughly such that we are able to illuminate the disease occurrence, achieve the optimal drug-target therapeutic effect and describe the protein complex structures. However, compared to the protein sequences obtainable from various species and organisms, the number of revealed protein-protein interactions is relatively limited. To address this dilemma, lots of research endeavor have investigated in it to facilitate the discovery of novel PPIs. Among these methods, PPI prediction techniques that merely rely on protein sequence data are more widespread than other methods which require extensive biological domain knowledge.ResultsIn this paper, we propose a multi-modal deep representation learning structure by incorporating protein physicochemical features with the graph topological features from the PPI networks. Specifically, our method not only bears in mind the protein sequence information but also discerns the topological representations for each protein node in the PPI networks. In our paper, we construct a stacked auto-encoder architecture together with a continuous bag-of-words (CBOW) model based on generated metapaths to study the PPI predictions. Following by that, we utilize the supervised deep neural networks to identify the PPIs and classify the protein families. The PPI prediction accuracy for eight species ranged from 96.76% to 99.77%, which signifies that our multi-modal deep representation learning framework achieves superior performance compared to other computational methods.ConclusionTo the best of our knowledge, this is the first multi-modal deep representation learning framework for examining the PPI networks.

Similar Papers
  • Research Article
  • Cite Count Icon 506
  • 10.1109/access.2019.2916887
Deep Multimodal Representation Learning: A Survey
  • Jan 1, 2019
  • IEEE Access
  • Wenzhong Guo + 2 more

Multimodal representation learning, which aims to narrow the heterogeneity gap among different modalities, plays an indispensable role in the utilization of ubiquitous multimodal data. Due to the powerful representation ability with multiple levels of abstraction, deep learning-based multimodal representation learning has attracted much attention in recent years. In this paper, we provided a comprehensive survey on deep multimodal representation learning which has never been concentrated entirely. To facilitate the discussion on how the heterogeneity gap is narrowed, according to the underlying structures in which different modalities are integrated, we category deep multimodal representation learning methods into three frameworks: joint representation, coordinated representation, and encoder-decoder. Additionally, we review some typical models in this area ranging from conventional models to newly developed technologies. This paper highlights on the key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of our knowledge, have never been reviewed previously, even though they have become the major focuses of much contemporary research. For each framework or model, we discuss its basic structure, learning objective, application scenes, key issues, advantages, and disadvantages, such that both novel and experienced researchers can benefit from this survey. Finally, we suggest some important directions for future work.

  • Research Article
  • Cite Count Icon 77
  • 10.1007/s11280-018-0548-3
Multimodal deep representation learning for video classification
  • May 3, 2018
  • World Wide Web
  • Haiman Tian + 4 more

Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

  • Conference Article
  • Cite Count Icon 44
  • 10.1109/icmlc48188.2019.8949228
Multimodal Representation Learning: Advances, Trends and Challenges
  • Jul 1, 2019
  • Su-Fang Zhang + 4 more

Representation learning is the base and crucial for consequential tasks, such as classification, regression, and recognition. The goal of representation learning is to automatically learning good features with deep models. Multimodal representation learning is a special representation learning, which automatically learns good features from multiple modalities, and these modalities are not independent, there are correlations and associations among modalities. Furthermore, multimodal data are usually heterogeneous. Due to the characteristics, multimodal representation learning poses many difficulties: how to combine multimodal data from heterogeneous sources; how to jointly learning features from multimodal data; how to effectively describe the correlations and associations, etc. These difficulties triggered great interest of researchers along with the upsurge of deep learning, many deep multimodal learning methods have been proposed by different researchers. In this paper, we present an overview of deep multimodal learning, especially the approaches proposed within the last decades. We provide potential readers with advances, trends and challenges, which can be very helpful to researchers in the field of machine, especially for the ones engaging in the study of multimodal deep machine learning.

  • Research Article
  • Cite Count Icon 1299
  • 10.1109/tgrs.2020.3016820
More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification
  • Aug 16, 2020
  • IEEE Transactions on Geoscience and Remote Sensing
  • Danfeng Hong + 6 more

Classification and identification of the materials lying over or beneath the Earth's surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS) and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on "what", "where", and "how" to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS datasets. Furthermore, the codes and datasets will be available at https://github.com/danfenghong/IEEE_TGRS_MDL-RS, contributing to the RS community.

  • Dissertation
  • 10.25148/etd.fidc007772
Multimodal Data Analytics and Fusion for Data Science
  • Jun 6, 2019
  • Haiman Tian

Advances in technologies have rapidly accumulated a zettabyte of “new” data every two years. The huge amount of data have a powerful impact on various areas in science and engineering and generates enormous research opportunities, which calls for the design and development of advanced approaches in data analytics. Given such demands, data science has become an emerging hot topic in both industry and academia, ranging from basic business solutions, technological innovations, and multidisciplinary research to political decisions, urban planning, and policymaking. Within the scope of this dissertation, a multimodal data analytics and fusion framework is proposed for data-driven knowledge discovery and cross-modality semantic concept detection. The proposed framework can explore useful knowledge hidden in different formats of data and incorporate representation learning from data in multimodalities, especial for disaster information management. First, a Feature Affinity-based Multiple Correspondence Analysis (FA-MCA) method is presented to analyze the correlations between low-level features from different features, and an MCA-based Neural Network (MCA-NN) ispro- posedto capture the high-level features from individual FA-MCA models and seamlessly integrate the semantic data representations for video concept detection. Next, a genetic algorithm-based approach is presented for deep neural network selection. Furthermore, the improved genetic algorithm is integrated with deep neural networks to generate populations for producing optimal deep representation learning models. Then, the multimodal deep representation learning framework is proposed to incorporate the semantic representations from data in multiple modalities efficiently. At last, fusion strategies are applied to accommodate multiple modalities. In this framework, cross-modal mapping strategies are also proposed to organize the features in a better structure to improve the overall performance.

  • Research Article
  • 10.52783/jisem.v10i16s.2593
Stress Detection using Multimodal Representation Learning, Fusion Techniques, and Applications
  • Mar 6, 2025
  • Journal of Information Systems Engineering and Management
  • Yogesh J Gaikwad

The fields of speech recognition, image identification, and natural language processing have undergone a paradigm shift with the advent of machine learning and deep learning approaches. Although these tasks rely primarily on a single modality for input signals, the artificial intelligence field has various applications that necessitate the use of several modalities. In recent years, academics have placed a growing emphasis on the intricate topic of modelling and learning across various modalities. This has attracted the interest of the scientific community. This technical article provides a comprehensive analysis of the models and learning methods available for multimodal intelligence. Specifically, this work concentrates on the fusion of video and language processing modalities, which has become a crucial area in both computer vision and natural language research. In this article, we explore recent research on multimodal deep learning from three different perspectives: learning multimodal representations, combining multimodal inputs at different levels, and multimodal applications. Regarding the learning of multimodal representations, the article delves into the concept of embedding, which involves the combination of different types of signals into a unified vector space. This enables cross-modal signal processing, which has significant implications for various applications. Moreover, several forms of embedding created and trained for common downstream tasks are examined. Regarding multimodal fusion, the research focuses on specific designs that merge representations of unimodal inputs for a specific purpose.

  • PDF Download Icon
  • Research Article
  • 10.1007/s13735-025-00382-8
A Comprehensive Review of Multimodal Visual Representation Learning: Tracing the Evolution from CNNs to Transformers and Beyond
  • Sep 30, 2025
  • International Journal of Multimedia Information Retrieval
  • Dong Zhang + 2 more

The primary goal of multimodal visual representation learning is to generate implicit information that effectively represents multimodal information by exploring the commonalities and characteristics between different modalities. This research report will discuss currently widely used advanced methods in the field of multimodal visual representation learning. This article will discuss these methods in the following order, culminating in multimodal visual learning: (1) pre-trained visual representation learning, (2) generative visual representation learning, (3) contrastive multimodal visual representation learning, and (4) image-text multimodal visual representation learning methods. Each element provides useful clues that ultimately lead to multimodal visual learning. Pre-trained visual representation learning refers to the application of supervised pre-training models in visual representation learning, while generative visual representation learning uses generative models to learn feature representations that can integrate multimodal information. Contrastive multimodal visual representation learning uses contrastive learning methods to compare similar and dissimilar sample pairs, learning feature representations in a self-supervised manner. Image-text multimodal visual representation learning methods, on the other hand, attempt to enhance the capabilities of visual representation learning by fusing visual information (such as images) with textual information. This review report will explain the above research background, the classification of different research methods, commonly used evaluation methods , and future development trends.

  • Research Article
  • 10.1007/s10278-025-01788-w
A Timeseries-based Multimodal Deep Learning Approach for Lung Nodule Growth Prediction.
  • Dec 16, 2025
  • Journal of imaging informatics in medicine
  • Duc-Khanh Nguyen + 4 more

Lung nodules, while often benign, can become significant health concerns if their growth is not monitored accurately. Predicting lung nodule growth is critical for improving patient outcomes and guiding clinical decision-making. This study aims to develop a Multimodal Deep Learning Approach to enhance the accuracy of lung nodule growth prediction by integrating time-series CT image data with demographics and nodule-specific features. Data were collected from the Far Eastern Memorial Hospital, Taiwan, including CT image sequences of lung nodules and patient demographics and nodule-specific features. Using this dataset, a Multimodal Deep Learning framework was developed and optimized. The model's performance was assessed using metrics such as Accuracy, Precision, Sensitivity, F1-score, and AUC. The proposed Multimodal Deep Learning framework substantially outperformed traditional machine learning and unimodal models. Among all configurations, the repeat frame strategy achieved the best overall performance, with an accuracy of 0.929, precision of 0.878, sensitivity of 0.908, F1-score of 0.878, and AUC of 0.977. Paired t-test analysis confirmed that these improvements were statistically significant (p < 0.05) compared to other multimodal variants and baseline models. These results highlight the model's ability to effectively integrate image, demographics, and nodule-specific features, leading to superior predictive accuracy and robust clinical decision-support potential. By using the time-series of CT image data, along with demographics and nodule-specific features, the proposed Multimodal Deep Learning provides a reliable tool for predicting lung nodule growth. This advancement has significant implications for lung nodule management, offering clinicians a robust and dependable resource to support medical decision-making and improve patient care. The findings highlight the transformative potential of deep learning techniques in critical healthcare domains.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 57
  • 10.1038/s41422-024-00989-2
Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering
  • Jul 5, 2024
  • Cell Research
  • Peng Cheng + 19 more

Mutations in amino acid sequences can provoke changes in protein function. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Protein Mutational Effect Predictor (ProMEP), a general and multiple sequence alignment-free method that enables zero-shot prediction of mutation effects. A multimodal deep representation learning model embedded in ProMEP was developed to comprehensively learn both sequence and structure contexts from ~160 million proteins. ProMEP achieves state-of-the-art performance in mutational effect prediction and accomplishes a tremendous improvement in speed, enabling efficient and intelligent protein engineering. Specifically, ProMEP accurately forecasts mutational consequences on the gene-editing enzymes TnpB and TadA, and successfully guides the development of high-performance gene-editing tools with their engineered variants. The gene-editing efficiency of a 5-site mutant of TnpB reaches up to 74.04% (vs 24.66% for the wild type); and the base editing tool developed on the basis of a TadA 15-site mutant (in addition to the A106V/D108N double mutation that renders deoxyadenosine deaminase activity to TadA) exhibits an A-to-G conversion frequency of up to 77.27% (vs 69.80% for ABE8e, a previous TadA-based adenine base editor) with significantly reduced bystander and off-target effects compared to ABE8e. ProMEP not only showcases superior performance in predicting mutational effects on proteins but also demonstrates a great capability to guide protein engineering. Therefore, ProMEP enables efficient exploration of the gigantic protein space and facilitates practical design of proteins, thereby advancing studies in biomedicine and synthetic biology.

  • Research Article
  • Cite Count Icon 400
  • 10.1007/s00371-021-02166-7
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
  • Jun 10, 2021
  • The Visual Computer
  • Khaled Bayoudh + 3 more

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

  • Research Article
  • Cite Count Icon 457
  • 10.1109/jstsp.2020.2987728
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
  • Mar 1, 2020
  • IEEE Journal of Selected Topics in Signal Processing
  • Chao Zhang + 3 more

Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

  • Research Article
  • 10.37648/ijrst.v12i03.009
An In-Depth Analysis of the Multimodal Representation Learning with Respect to the Applications and Linked Challenges in Multiple Sectors
  • Jan 1, 2022
  • International Journal of Research in Science and Technology
  • Arnav Goenka

Representation learning is a machine learning type wherein a system automatically uses deep models to extract features from raw data. It is essential for tasks like classifications, regression, and identification. Multimodal representation learning is a subset of representation learning that focuses on feature extraction from several heterogeneous, interconnected modalities. Although these modalities are frequently heterogeneous, they show correlations and relationships. These modalities include text, images, audio, or videos. Several difficulties arise from this intrinsic complexity, including combining multimodal data from various sources by precisely characterizing the relationships and correlations between modalities and jointly deriving features from multimodal data. Researchers are becoming increasingly interested in these problems, particularly as deep learning gains momentum. In recent years, many deep multimodal learning techniques have been developed. We present an overview of deep multimodal learning in this study, focusing on techniques that have been proposed in the past decade. We aim to provide readers with valuable insights for researchers, especially those working on multimodal deep machine learning, by educating them on the latest developments, trends, and difficulties in this field.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.knosys.2024.111990
A Dynamic Multi-modal deep Reinforcement Learning framework for 3D Bin Packing Problem
  • May 24, 2024
  • Knowledge-Based Systems
  • Anhao Zhao + 2 more

A Dynamic Multi-modal deep Reinforcement Learning framework for 3D Bin Packing Problem

  • Supplementary Content
  • Cite Count Icon 28
  • 10.1093/genetics/iyae161
A review of multimodal deep learning methods for genomic-enabled predictionin plant breeding
  • Nov 5, 2024
  • Genetics
  • Osval A Montesinos-López + 9 more

Deep learning methods have been applied when working to enhance the prediction accuracyof traditional statistical methods in the field of plant breeding. Although deep learningseems to be a promising approach for genomic prediction, it has proven to have somelimitations, since its conventional methods fail to leverage all available information.Multimodal deep learning methods aim to improve the predictive power of their unimodalcounterparts by introducing several modalities (sources) of input information. In thisreview, we introduce some theoretical basic concepts of multimodal deep learning andprovide a list of the most widely used neural network architectures in deep learning, aswell as the available strategies to fuse data from different modalities. We mention someof the available computational resources for the practical implementation of multimodaldeep learning problems. We finally performed a review of applications of multimodal deeplearning to genomic selection in plant breeding and other related fields. We present ameta-picture of the practical performance of multimodal deep learning methods to highlighthow these tools can help address complex problems in the field of plant breeding. Wediscussed some relevant considerations that researchers should keep in mind when applyingmultimodal deep learning methods. Multimodal deep learning holds significant potential forvarious fields, including genomic selection. While multimodal deep learning displaysenhanced prediction capabilities over unimodal deep learning and other machine learningmethods, it demands more computational resources. Multimodal deep learning effectivelycaptures intermodal interactions, especially when integrating data from different sources.To apply multimodal deep learning in genomic selection, suitable architectures and fusionstrategies must be chosen. It is relevant to keep in mind that multimodal deep learning,like unimodal deep learning, is a powerful tool but should be carefully applied. Given itspredictive edge over traditional methods, multimodal deep learning is valuable inaddressing challenges in plant breeding and food security amid a growing globalpopulation.

  • Abstract
  • 10.1136/annrheumdis-2024-eular.1222
AB0875 ARTIFICIAL INTELLIGENCE TO PREDICT DISEASE ACTIVITY USING A MULTIMODAL MODEL WITH MAGNETIC RESONANCE IMAGING AND LABORATORY RESULTS IN PATIENTS WITH AXIAL SPONDYLOARTHRITIS
  • Jun 1, 2024
  • Annals of the Rheumatic Diseases
  • H S Cha + 5 more

Background:Sacral magnetic resonance imaging (MRI) helps determine whether patients with axial spondyloarthritis (axSpA) have active disease by detecting sacroiliitis. However, there is no consensus on the extent to which sacroiliitis...

Save Icon
Up Arrow
Open/Close
Setting-up Chat
Loading Interface