Discovery and prediction of protein binding sites in DNA and RNA sequences using Bayesian Markov models

Wanwan Ge

doi:10.53846/goediss-8555

Abstract

Transcription factors control the essential step of gene expression via recognizing the over- represented binding sites (or motifs) on the genome. One crucial task is to accurately predict these binding sites on the genome, to understand the regulatory mechanisms. This thesis approaches this task in three parts. In the first part, I introduce a tool, BaMMmotif2, that I have developed to identify motifs de novo from DNA sequencing data. Compared to the existing position weight matrix (PWM)-based motif discovery tools, the higher-order Bayesian Markov models (BaMMs) have the advantages of learning the interdependence of the nucleotides for transcription factor binding while being fast and having high predictive accuracy. The core of the BaMMs is that the higher-order probability is learned by combining the k-mer counts and the probability of one order lowers with a pseudo-factor α tuning the weights between the two. I optimize a position- and order-specific pseudo-factor α for higher-order BaMMs. I also introduce the method to learn the positional preferences of the transcription factors. Besides, I apply a masking step to the input sequences to train the model only with the most relevant positions, and thus it helps distinguish weak motifs when multiple binding motifs are present in the data. In the second part, I introduced a new and better motif performance score, the average recall (AvRec score), to give the users some guidance on evaluating the motif quality. Besides, to validate the existing motif detection tools, I developed a full scheme including (I) N-fold cross-validation, (II) cross-platform validation, and (III) cross-cell-line validation. In 5-fold cross-validation, BaMMmotif2 outperforms the selected state-of-the-art tools in this field, with at least 13.6% and 12.2% median increase in the AvRec score using in vivo and in vitro data, respectively. In the cross-cell-line validations on 238 datasets, BaMMmotif2 gains >11% median increases in the AvRec score. BaMMs also perform the best in the cross-platform validation on 16 data sets. By applying BaMMs for the CTCF motif to scan the whole human genome, I discover 1.5 million CTCF binding sites with high accuracy. This result could lead to a better understanding of the genome 3D structure and its biological functions.In the third part, we offer the community an interactive web server with the tool and database: bammmotif.soedinglab.org. It provides four main functionalities: (I) de novo predicting motifs from DNA/RNA sequences, (II) finding motif occurrences given a sequence and a motif model, (III) searching for similar known motifs in the database, given a novel motif model, and (IV) offering databases with higher-order BaMMs for different organisms.

Full Text