Abstract

A general theory is described for making decisions as to which of two modelled hypotheses (that can depend on unknown parameters) best fits each of a set of data sets, such that the average power is maximized. Statistical independence between the large number of data sets of the same type is assumed. Therefore error rates can be expressed as proportions and the continuous approach to the data model is used. The framework of decision theory is used and the equivalence between different criteria for optimization is demonstrated. General procedures are shown to satisfy this criterion in the cases when each hypothesis has a finite number of unknown parameters, and when the alternative hypothesis is vacuous. If the null hypothesis is determined by a known distribution of a test statistic, this reduces to using the density of $p$-values of this test statistic as the final test statistic to rank the data into significance order. For two scenarios, one of three density estimation methods based on the kernel density estimate gave a result almost equivalent in power to the likelihood ratio test that uses full knowledge of the null and alternative models, and compared favourably with the optimal discovery procedure (ODP) and its iterated variant. For genetic expression data from microarrays and more recently RNA-Seq experiments where the data for different genes are not generally independent, it is suggested to use this technique with the $p$-values from methods such as Surrogate Variable Analysis that removes much of the effects of dependence.

Highlights

  • There is continuing interest in multiple hypothesis testing procedures resulting from the desire to maximize efficiency in the simultaneous testing of large numbers of data sets such as microarray data and more recently RNA-Seq data (Wang, Gerstein, & Snyder, 2009) that after a lot of pre-processing (Givan, Bottoms, & Spollen, 2012) give simultaneously the expression levels of a very large number of genes from RNA samples

  • To assess the practicality of the p-value density method for doing multiple hypothesis testing, three versions of this method corresponding to the three density estimates (46, 47 and 48) based on the binomial test were compared with using the p-values from the underlying binomial test, with the likelihood ratio test, and with the optimal discovery procedure (ODP) (Storey, 2007) and its iterative extension (Nixon, 2012) both using the actual and estimated values of r in the binomial test on which these methods are based

  • It is clear from these results that the improvement due to iteration of the ODP (Nixon, 1012) compared with the binomial test was mainly due to r being re-estimated as a value close to 0.3 using the least significant data only i.e. those most likely to be nulls, compared with using r fitted from the entire data set as a value close to r = 0.4, which was used in the binomial test reported

Read more

Summary

Introduction

There is continuing interest in multiple hypothesis testing procedures resulting from the desire to maximize efficiency in the simultaneous testing of large numbers of data sets such as microarray data and more recently RNA-Seq data (Wang, Gerstein, & Snyder, 2009) that after a lot of pre-processing (Givan, Bottoms, & Spollen, 2012) give simultaneously the expression levels of a very large number of genes from RNA samples. Much recent theoretical work involves correction for effects not explicitly modelled that cause correlation among the data for individual tests (Leek & Storey, 2007; Leek & Storey, 2008; Lunceford et al, 2011; Chakraborty, Datta, Somnath, & Datta, Susmita, 2012), while the simpler problem of handling independent tests (Storey, 2007; Hwang & Liu 2010), does not seem to have been fully explored in its practical implementation when one or both hypotheses have unknown (hyper)parameters (Nixon, 2012) This may be because the perception of the need to deal with dependence makes such a study almost irrelevant. It has been shown theoretically that data sets after the extraction of the surrogate variables are independent (Leek & Storey, 2008) undermining this perception and demonstrating that a thorough analysis of the simpler case where the separate data sets are independent is of fundamental importance

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.