Novel Transformer Networks for Improved Sequence Labeling in genomics.

Jim Clauwaert,Willem Waegeman

doi:10.1109/tcbb.2020.3035021

Abstract

In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.

Highlights

In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with
We introduced a novel framework for full genome annotation tasks by applying the transformer-XL network architecture
An improvement in predictive performance was obtained, which indicates that the technique enhances the detection of nucleotide motifs that are relevant to the prediction task

Summary

Introduction

In the last 30 years, a major effort has been invested into uncovering the relation between the genome and the biological processes it interacts with. Machine learning methodologies play an increasingly import role in the construction of predictive tools These tasks include the annotation of genomic positions of relevance, such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In order to create a feasible sample input, only a short fragment of the genome sequence is used to predict the occurrence of these sites The boundaries of this region with respect to the position of interest is denoted as the receptive field. Existing studies do not apply the full genome for training or evaluation This task is too resource-heavy for a multitude of machine learning methodologies that have not been created to handle millions of samples. We prove that the novel transformer network attains state-of-the-art performances, while retaining fast training times

Related work

Transformer Network

Basic model

Multi-head attention

Recurrence

Extension

Experimental setup

Selection of lmem

Benchmarking

Findings

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Computational Biology and Bioinformatics	Publication Date: Oct 30, 2020
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Novel Transformer Networks for Improved Sequence Labeling in genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Computational Biology and Bioinformatics

Lead the way for us

Similar Papers

Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS
Juexiao Zhou ... Xin Gao
Genomics, Proteomics & Bioinformatics | VOL. 20
Juexiao Zhou, et. al.Juexiao Zhou ... Xin Gao
01 Oct 2022
Genomics, Proteomics & Bioinformatics | VOL. 20

Molecular Cloning and Characterization of the Human AKT1 Promoter Uncovers Its Up-regulation by the Src/Stat3 Pathway
Sungman Park ... Jin Q Cheng
Journal of Biological Chemistry | VOL. 280
Sungman Park, et. al.Sungman Park ... Jin Q Cheng
01 Nov 2005
Journal of Biological Chemistry | VOL. 280

Differential RNA-Seq and transcription start site annotation in Chlamydia

-

01 Jan 2019
01 Jan 2019

Asymmetric Methylation in the Hypermethylated CpG Promoter Region of the Human L1 Retrotransposon
David M Woodcock ... William D Warren
Journal of Biological Chemistry | VOL. 272
David M Woodcock, et. al.David M Woodcock ... William D Warren
01 Mar 1997
Journal of Biological Chemistry | VOL. 272

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Novel Transformer Networks for Improved Sequence Labeling in genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Computational Biology and Bioinformatics