Unifying generative and discriminative learning principles.

Jens Keilwagen,Jan Grau,Ivo Grosse,Stefan Posch,Marc Strickert

doi:10.1186/1471-2105-11-98

Abstract

BackgroundThe recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.ResultsHere, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.ConclusionsWe find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.

Highlights

The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research
We present six learning principles that have been proposed in the machine-learning community and that are nowadays used in bioinformatics
We provide a mathematical interpretation of this learning principle, and in the fourth subsection we present four case studies illustrating the utility of this learning principle

Summary

Introduction

The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. Classification of unlabeled data is one of the main tasks in bioinformatics. For DNA sequence analysis, this classification task is synonymous to the computational recognition of short signal sequences in genomic DNA. Many of the employed algorithms use statistical models for representing the distribution of sequences. These models range from simple models like the position weight matrix (PWM) model [1,13,14], the weight array matrix (WAM) model [6,8,15], or Markov models of higher order [16,17] to complex models like Bayesian networks [2,18,19] or Markov random fields [7,20,21]. A wealth of different models has been proposed for different data sets and different biological questions, and it is advisable to carefully choose an appropriate model for

Methods

Results

Conclusion