Abstract
Effective embedding is actively conducted by applying deep learning to biomolecular information. Obtaining better embeddings enhances the quality of downstream analyses, such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations and apply this algorithm to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-training algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this ‘informative base embedding’ and use it to achieve accuracies superior to those of existing state-of-the-art methods on RNA structural alignment and RNA family clustering tasks. Furthermore, upon performing RNA sequence alignment by combining this informative base embedding with a simple Needleman–Wunsch alignment algorithm, we succeed in calculating structural alignments with a time complexity of O(n2) instead of the O(n6) time complexity of the naive implementation of Sankoff-style algorithm for input RNA sequence of length n.
Highlights
Unstructured data, such as biological sequences and networks, require an embedding operation that encodes the unstructured data into a high-dimensional numerical vector space
To investigate whether RNABERT acquired an informative base embedding to encode four RNA bases and secondary structure information, the embedded representations output from the transformer layer for a set of RNA sequences were projected into two-dimensional space using t-distributed stochastic neighbour embedding (t-SNE) [37], which is a dimension reduction algorithm for mapping high-dimensional data to low dimensions
This result clearly shows that RNABERT embedding using pretraining with structural alignment learning (SAL) and masked language modelling (MLM) tasks succeeded in encoding base information and secondary structure information
Summary
Unstructured data, such as biological sequences and networks, require an embedding operation that encodes the unstructured data into a high-dimensional numerical vector space. DNA, RNA and amino acid sequences have been attempted to be effectively embedded using deep representation learning, especially techniques developed in the field of natural language processing [1,2,3]. These studies are based on the idea that nucleotide composition and sequence structure determine the motif and function of a gene sequence, just as the complex grammatical structure of natural language determines the meaning of a sentence. Dna2vec adopts the word2vec technique by defining a k-mer as a word in the DNA sequence; since dna2vec assumes a sufficient number of different words used for embedding, the four nucleotides (four words) are not large enough to obtain an effective embedding when dna2vec is applied to base-by-base DNA sequence embedding
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.