Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation

Mickael Guedj,Gregory Nuel,Alain Celisse,Stephane Robin

doi:10.1186/1471-2105-10-84

Abstract

BackgroundThe use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true.ResultsIn this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies.ConclusionThe proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at .

Highlights

The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem
Multiple-testing problems occur in many bioinformatic studies where we considere a large set of biological objects and we want to test a null hypothesis H for each object
In the last decade the False Discovery Rate (FDR) criterion introduced in [2] has received the greatest focus, due to its lower conservativeness compared to the Family-Wise Error-Rate (FWER)

Summary

Introduction

The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. Multiple-testing problems occur in many bioinformatic studies where we considere a large set of biological objects (genes, SNPs, DNA patterns, etc.) and we want to test a null hypothesis H for each object. The control of the number of false positives, i.e. falsely rejected hypotheses, is the crucial issue in multiple testing. To this end, several error rates, such as the Family-Wise Error-Rate (FWER) or the False Discovery Rate (FDR), have emerged and various strategies to control these criteria have been developed (see [1] for a review). It is a global criterion that cannot be used to assess the reliability of a specific hypothesis, i.e. that of a given gene, SNP or pattern

Methods

Results

Conclusion