Abstract

Transcription factors (TFs) play an essential role in molecular biology by regulating gene expression. The binding sites of TFs can vary by a large amount and the numerous possible binding locations make their detection a challenging issue. Recently, several machine learning approaches using nucleotide sequence data were applied to classify DNA sequences regarding Transcription Factor Binding Sites (TFBS). We propose a novel training strategy without the traditional 1D nucleotide-based DNA sequence representation by instead using a 2D topological matrix of sub-nucleotide chemical functional groups substantially defining the protein binding ability of DNA fragments. We train convolutional neural networks using this novel Functional Group DNA Representation (FGDR) to solve a TFBS classification task. We compare our results with the efficiency of previous nucleotide-based training approaches and show that learning from an FGDR data sequence has several benefits regarding TFBS classification. Moreover, we reason that learning deep neural networks from the FGDR representation produces competitive results while only introducing a pre-processing conversion step. Finally, we show that employing an ensemble of models from the nucleotide and FGDR representations for network training results in higher classification performance than any of the single input approaches.

Highlights

  • Transcription factors (TFs) are gene expression regulating proteins which play an important role in almost all cell physiological processes and in the related molecular mechanisms

  • In the last few years previous bioinformatics methods based on position weight matrices and other interpretable statistical methods for identification of DNA recognition motifs were surpassed by machine learning approaches trained on nucleotide sequence data

  • Since Functional Group DNA Representation (FGDR) is a larger input space compared to nucleotide data, we found that constructing an adequately complex or deep CNN is necessary for accurate model performance

Read more

Summary

Introduction

Transcription factors (TFs) are gene expression regulating proteins which play an important role in almost all cell physiological processes and in the related molecular mechanisms. Transcription factors detect and bind DNA double helix strands at TF specific positions called DNA recognition motifs. Motifs are represented by the sequential combination of A-C-G-T nucleotides and are typically 4-18 base-pair long. Finding and classifying these motifs is a long-standing question of molecular and computational biology. In the last few years previous bioinformatics methods based on position weight matrices and other interpretable statistical methods for identification of DNA recognition motifs were surpassed by machine learning approaches trained on nucleotide sequence data. Learning CNNs on this novel representation for TFBS classification surpass the performance of other, nucleotide sequence-based methods

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.