Abstract

Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.

Highlights

  • The study of genomics, and much of computational molecular biology, is about the inference or prediction of discrete, highdimensional unobserved variables, based on observed data

  • To assess the credibility measures and estimators described above, we examine the local alignments of sequences from (1) 20 orthologous genes between human and rodent, and (2) 24 orthologous genes between six species of Shewanella

  • Credibility Limits for Human/Rodent Pairs The 20 orthologous genes for human/rodent are upregulated in human skeletal muscle tissue, and their upstream sequences have been used in previous studies to locate cisregulatory modules [36]

Read more

Summary

Introduction

The study of genomics, and much of computational molecular biology, is about the inference or prediction of discrete, highdimensional (high-D) unobserved variables, based on observed data. In RNA secondary structure prediction, the challenge is to select a specific set of base pairs from a combinatorially large collection, as a prediction of the secondary structure of an RNA polymer, given its sequence. In pathway inference, the challenge is to select a set of graph edges to connect genes or their products (nodes) from a combinatorially large collection of possible edge sets, based on gene expression or other data. Model selection problems for studying diseases stemming from mutlifactorial inheritance are becoming increasing common in the post-genome era. In these studies, the ultimate goal is to identify the combinations of genes responsible for inheritance components of disease etiology based on genetic and/ or other post-genome data. Procedures that select the single best scoring solution, such as maximum similarity, maximum likelihood, maximum a-posteriori (MAP), or minimum free energy, dominate most of these problems

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call