Abstract

In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.

Highlights

  • In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with

  • We introduced a novel framework for full genome annotation tasks by applying the transformer-XL network architecture

  • An improvement in predictive performance was obtained, which indicates that the technique enhances the detection of nucleotide motifs that are relevant to the prediction task

Read more

Summary

Introduction

In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with. Machine learning methodologies play an increasingly import role in the construction of predictive tools These tasks include the annotation of genomic positions of relevance, such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In order to create a feasible sample input, only a short fragment of the genome sequence is used to predict the occurrence of these sites The boundaries of this region with respect to the position of interest is denoted as the receptive field. Existing studies do not apply the full genome for training or evaluation This task is too resource-heavy for a multitude of machine learning methodologies that have not been created to handle millions of samples. We prove that the novel transformer network attains state-of-the-art performances, while retaining fast training times

Related work
Transformer Network
Basic model
Multi-head attention
Recurrence
Extension
Experimental setup
Selection of lmem
Benchmarking
Findings
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.