Abstract

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

Highlights

  • For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results

  • We develop the first False Discovery Rate (FDR)-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and local FDR (lFDR) perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context

  • For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis

Read more

Summary

Introduction

The evaluation of statistical significance is crucial in genome-wide studies, such as detecting differentially-expressed genes in microarray or proteomic studies, performing genome-wide association studies, and uncovering homologous sequences. Accurate statistics for pairwise alignments and their use in database search [1,2,3] were introduced with the use of random sequence models and E-values two decades ago [4,5]. While different approaches to detect sequence similarity have relied on a variety of statistics, including bit scores [13,14] and Z-scores [3], most modern approaches are based on E-values. Detecting sequence similarity in order to uncover homologous relationships between proteins remains the single most powerful tool for function prediction. Many modern sequence similarity approaches are based on identifying domains, which are fundamental units of protein structure, function, and evolution. Homologous domains are grouped into “families” that may be associated with specific functions and structures, and these domain families organize protein space. HMM-based software, such as the state-of-the-art HMMER program [18], has features that make it superior to its predecessors, accurate significance measures arose only recently [19]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call