Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks.

Žiga Avsec,Julien Gagneur,Jun Cheng,Mohammadamin Barekatain,Inanc Birol

doi:10.1093/bioinformatics/btx727

Abstract

MotivationRegulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed.ResultsHere we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox.Availability and implementationSpline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

In recent years, deep learning has proven to be powerful for modeling gene regulatory sequences
We show that across our applications, spline transformation leads to better predictive performance, trains faster and is more robust to initialization than piecewise linear transformations (PLTs), an alternative class of functions based on the popular rectified linear units (ReLUs)
3.1 Relative distance to genomic landmarks improves in vivo RNA-binding proteins (RBPs) binding prediction

Summary

Introduction

Deep learning has proven to be powerful for modeling gene regulatory sequences. Improved predictive accuracies have been obtained for a wide variety of applications spanning the modeling of sequences affecting chromatin states (Kelley et al, 2016; Zhou and Troyanskaya, 2015), transcription factor binding (Alipanahi et al, 2015), DNA methylation (Angermueller et al, 2017) and RNA splicing (Leung et al, 2014; Xiong et al, 2015), among others. Using multiple layers of non-linear transformations, deep learning models learn abstract representations of the raw data.

Methods

Results

Conclusion