Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

Manato Akiyama,Yasubumi Sakakibara

doi:10.1093/nargab/lqac012

Abstract

Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this ‘informative base embedding’ and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman–Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.

Highlights

Unstructured data, such as biological sequences and networks, require an embedding operation that encodes the unstructured data into a high-dimensional numerical vector space
To investigate whether RNABERT acquired an informative base embedding to encode four RNA bases and secondary structure information, the embedded representations output from the transformer layer for a set of RNA sequences were projected into two-dimensional space using t-distributed stochastic neighbour embedding (t-SNE) [37], which is a dimension reduction algorithm for mapping high-dimensional data to low dimensions
This result clearly shows that RNABERT embedding using pretraining with structural alignment learning (SAL) and masked language modelling (MLM) tasks succeeded in encoding base information and secondary structure information

Summary

Introduction

Unstructured data, such as biological sequences and networks, require an embedding operation that encodes the unstructured data into a high-dimensional numerical vector space. DNA, RNA and amino acid sequences have been attempted to be effectively embedded using deep representation learning, especially techniques developed in the field of natural language processing [1,2,3]. These studies are based on the idea that nucleotide composition and sequence structure determine the motif and function of a gene sequence, just as the complex grammatical structure of natural language determines the meaning of a sentence. Dna2vec adopts the word2vec technique by defining a k-mer as a word in the DNA sequence; since dna2vec assumes a sufficient number of different words used for embedding, the four nucleotides (four words) are not large enough to obtain an effective embedding when dna2vec is applied to base-by-base DNA sequence embedding

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: NAR Genomics and Bioinformatics	Publication Date: Jan 13, 2022
Citations: 35	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: NAR Genomics and Bioinformatics

Lead the way for us

Similar Papers

TOPAS: network-based structural alignment of RNA sequences.
Chun-Chi Chen ... Byung-Jun Yoon
Bioinformatics (Oxford, England) | VOL. 35
Chun-Chi Chen, et. al.Chun-Chi Chen ... Byung-Jun Yoon
10 Jan 2019
Bioinformatics (Oxford, England) | VOL. 35

Network-Based RNA Structural Alignment Through Optimal Local Neighborhood Matching
Hyun-Myung Woo ... Byung-Jun Yoon
-
Hyun-Myung Woo, et. al.Hyun-Myung Woo ... Byung-Jun Yoon
01 Nov 2020
01 Nov 2020

REDalign: accurate RNA structural alignment using residual encoder-decoder network
Chun-Chi Chen ... Hyundoo Jeong
BMC Bioinformatics | VOL. 25
Chun-Chi Chen, et. al.Chun-Chi Chen ... Hyundoo Jeong
05 Nov 2024
BMC Bioinformatics | VOL. 25

Secondary and Tertiary Structure Prediction of Proteins: A Bioinformatic Approach
Minu Kesheri ... Rajeshwar Prasad Sinha
-
Minu Kesheri, et. al.Minu Kesheri ... Rajeshwar Prasad Sinha
30 Nov 2014
30 Nov 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: NAR Genomics and Bioinformatics