Controlling the effect of multiple testing in Big Data

Sabina Denkowska

doi:10.15611/me.2014.10.01

Abstract

Big Data poses a new challenge to statistical data analysis. An enormous growth of available data and their multidimensionality challenge the usefulness of classical meth- ods of analysis. One of the most important stages in Big Data analysis is the verification of hypotheses and conclusions. With the growth of the number of hypotheses, each of which is tested at  significance level, the risk of erroneous rejections of true null hypotheses in- creases. Big Data analysts often deal with sets consisting of thousands, or even hundreds of thousands of inferences. FWER-controlling procedures recommended by Tukey (1953), are effective only for small families of inferences. In cases of numerous families of inferences in Big Data analyses it is better to control FDR, that is the expected value of the fraction of erroneous rejections out of all rejections. The paper presents marginal procedures of multi- ple testing which allow for controlling FDR as well as their interesting alternative, that is the joint procedure of multiple testing MTP based on resampling (Dudoit, van der Laan 2008). A wide range of applications, the possibility of choosing the Type I error rate and easily accessible software (MTP procedure is implemented in R multtest package) are their obvious advantages. Unfortunately, the results of the analysis of the MTP procedure ob- tained by Werft and Benner (2009) revealed problems with controlling FDR in the case of numerous sets of hypotheses and small samples. The paper presents a simulation experi- ment conducted to investigate potential restrictions of MTP procedure in case of large numbers of inferences and large sample sizes, which is typical of Big Data analyses. The experiment revealed that, regardless of the sample size, problems with controlling FDR occur when multiple testing procedures based on minima of unadjusted p-values ( ) are applied. Moreover, the experiment indicated the serious instability of the results of the MTP procedure (dependent on the number of bootstrap samplings) if multiple testing procedures based on minima of unadjusted p-values ( ) are used. The experiment described in the paper and the results obtained by Werft, Benner (2009) and Denkowska (2013) indicate the need for further research on MTP procedure.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Controlling the effect of multiple testing in Big Data

Abstract

Talk to us

Similar Papers

More From: Mathematical Economics

Lead the way for us

Journal: Mathematical Economics	Publication Date: Jan 1, 2014
Citations: 1

Similar Papers

So Many Correlated Tests, So Little Time! Rapid Adjustment of P Values for Multiple Correlated Tests
Karen N Conneely ... Michael Boehnke
The American Journal of Human Genetics | VOL. 81
Karen N Conneely, et. al.Karen N Conneely ... Michael Boehnke
01 Dec 2007
The American Journal of Human Genetics | VOL. 81

Legal Governance of Brain Data Derived from Artificial Intelligence
Mahika Ahluwalia
Voices in Bioethics | VOL. 7
Mahika AhluwaliaMahika Ahluwalia
02 Jun 2021
Voices in Bioethics | VOL. 7

Hypothesis Testing
Shane Allua ... Cheryl Bagley Thompson
Air Medical Journal | VOL. 28
Shane Allua, et. al.Shane Allua ... Cheryl Bagley Thompson
01 May 2009
Air Medical Journal | VOL. 28

What the Journal of the American Academy of Child and Adolescent Psychiatry Is Looking for in Neuroimaging Submissions
Tonya J.H White
Journal of the American Academy of Child & Adolescent Psychiatry | VOL. 60
Tonya J.H WhiteTonya J.H White
08 Dec 2020
Journal of the American Academy of Child & Adolescent Psychiatry | VOL. 60

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Controlling the effect of multiple testing in Big Data

Abstract

Talk to us

Similar Papers

More From: Mathematical Economics