Abstract
In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.
Highlights
In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with
We introduced a novel framework for full genome annotation tasks by applying the transformer-XL network architecture
An improvement in predictive performance was obtained, which indicates that the technique enhances the detection of nucleotide motifs that are relevant to the prediction task
Summary
In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with. Machine learning methodologies play an increasingly import role in the construction of predictive tools These tasks include the annotation of genomic positions of relevance, such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In order to create a feasible sample input, only a short fragment of the genome sequence is used to predict the occurrence of these sites The boundaries of this region with respect to the position of interest is denoted as the receptive field. Existing studies do not apply the full genome for training or evaluation This task is too resource-heavy for a multitude of machine learning methodologies that have not been created to handle millions of samples. We prove that the novel transformer network attains state-of-the-art performances, while retaining fast training times
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Computational Biology and Bioinformatics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.