A study on the application of topic models to motif finding algorithms.

Josep Basha Gutierrez,Kenta Nakai

doi:10.1186/s12859-016-1364-3

Abstract

BackgroundTopic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients.ResultsThe algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level.ConclusionsThe statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1364-3) contains supplementary material, which is available to authorized users.

Highlights

Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them
Addition of topic models to a previously developed algorithm (Statistical genetic algorithm (GA)) Previously to this study of topic models applied to the motif finding problem, we developed another algorithm with the structure of a GA, which used statistical coefficients as a fitness measurement [5]
The correlated topic model (CTM) algorithm was run with the following parameters: As for the statistical GA algorithm, it was run with the same parameters as in the original study [5]

Summary

Introduction

Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. We try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Topic models Topic models are statistical algorithms, based on natural language processing and machine learning, which try to discover the structure of a set of documents according to the abstract topics contained in them by hierarchical Bayesian analysis [4]. These algorithms allow examining a set of documents and determining the existing topics and their distribution among the documents based on the statistical properties of the words of a specific vocabulary in each one of them. where L Application of topic models to the motif finding problem

Methods

Results

Discussion

Conclusion