The folded k-spectrum kernel: A machine learning approach to detecting transcription factor binding sites with gapped nucleotide dependencies.

Bin Liu,Abdulkadir Elmas,Jacqueline M Dresch,Xiaodong Wang

doi:10.1371/journal.pone.0185570

Abstract

Understanding the molecular machinery involved in transcriptional regulation is central to improving our knowledge of an organism’s development, disease, and evolution. The building blocks of this complex molecular machinery are an organism’s genomic DNA sequence and transcription factor proteins. Despite the vast amount of sequence data now available for many model organisms, predicting where transcription factors bind, often referred to as ‘motif detection’ is still incredibly challenging. In this study, we develop a novel bioinformatic approach to binding site prediction. We do this by extending pre-existing SVM approaches in an unbiased way to include all possible gapped k-mers, representing different combinations of complex nucleotide dependencies within binding sites. We show the advantages of this new approach when compared to existing SVM approaches, through a rigorous set of cross-validation experiments. We also demonstrate the effectiveness of our new approach by reporting on its improved performance on a set of 127 genomic regions known to regulate gene expression along the anterio-posterior axis in early Drosophila embryos.

Highlights

When studying the complex control of gene expression, often the first step is to locate an enhancer, or cis-regulatory module (CRM), within the genome
We extend the concept of k-spectrum given in [29] by incorporating all types of nucleotide interdependency in k-mers to improve the predictive power of the feature set
Based on the cross-validation tests described in Fig 2, our algorithm resulted in the identification of three very interesting subsets of these CRMs: Group 1. those that are correctly detected as belonging to the ‘positive sequence set’ by gapped k-mers, but not detected by contiguous k-mers, Group 2. those that are incorrectly identified as belonging to the ‘negative sequence set’ by both gapped k-mers and contiguous k-mers, and

Summary

Introduction

When studying the complex control of gene expression, often the first step is to locate an enhancer, or cis-regulatory module (CRM), within the genome. An enhancer is a non-coding region of DNA, typically located upstream of the promoter region, which binds transcription factor (TF) proteins and subsequently regulates the gene’s expression. This regulation is extremely important for many of the key processes involved in embryonic development [1, 2]. The discovery of clusters of key TF binding sites has been critical to identifying potential enhancers in the genome, and mapping out the organization of these binding sites has been crucial in guiding our understanding of enhancer function and evolution [2,3,4].

Methods

Results

Discussion

Conclusion