Empirical Error Rate Research Articles

Identifying regional effects of interest in MRI datasets usually entails testing a priori hypotheses across many thousands of brain voxels, requiring control for false positive findings in these multiple hypotheses testing. Recent studies have suggested that parametric statistical methods may have incorrectly modeled functional MRI data, thereby leading to higher false positive rates than their nominal rates. Nonparametric methods for statistical inference when conducting multiple statistical tests, in contrast, are thought to produce false positives at the nominal rate, which has thus led to the suggestion that previously reported studies should reanalyze their fMRI data using nonparametric tools.To understand better why parametric methods may yield excessive false positives, we assessed their performance when applied both to simulated datasets of 1D, 2D, and 3D Gaussian Random Fields (GRFs) and to 710 real-world, resting-state fMRI datasets. We showed that both the simulated 2D and 3D GRFs and the real-world data contain a small percentage (<6%) of very large clusters (on average 60 times larger than the average cluster size), which were not present in 1D GRFs. These unexpectedly large clusters were deemed statistically significant using parametric methods, leading to empirical familywise error rates (FWERs) as high as 65%: the high empirical FWERs were not a consequence of parametric methods failing to model spatial smoothness accurately, but rather of these very large clusters that are inherently present in smooth, high-dimensional random fields. In fact, when discounting these very large clusters, the empirical FWER for parametric methods was 3.24%. Furthermore, even an empirical FWER of 65% would yield on average less than one of those very large clusters in each brain-wide analysis. Nonparametric methods, in contrast, estimated distributions from those large clusters, and therefore, by construct rejected the large clusters as false positives at the nominal FWERs. Those rejected clusters were outlying values in the distribution of cluster size but cannot be distinguished from true positive findings without further analyses, including assessing whether fMRI signal in those regions correlates with other clinical, behavioral, or cognitive measures. Rejecting the large clusters, however, significantly reduced the statistical power of nonparametric methods in detecting true findings compared with parametric methods, which would have detected most true findings that are essential for making valid biological inferences in MRI data. Parametric analyses, in contrast, detected most true findings while generating relatively few false positives: on average, less than one of those very large clusters would be deemed a true finding in each brain-wide analysis. We therefore recommend the continued use of parametric methods that model nonstationary smoothness for cluster-level, familywise control of false positives, particularly when using a Cluster Defining Threshold of 2.5 or higher, and subsequently assessing rigorously the biological plausibility of the findings, even for large clusters. Finally, because nonparametric methods yielded a large reduction in statistical power to detect true positive findings, we conclude that the modest reduction in false positive findings that nonparametric analyses afford does not warrant a re-analysis of previously published fMRI studies using nonparametric techniques.

BackgroundNext-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.ResultsWe performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.ConclusionsThe proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1052-3) contains supplementary material, which is available to authorized users.

Empirical Error Rate Research Articles

Related Topics

Articles published on Empirical Error Rate

Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty.

Homogeneity Test of Many-to-One Relative Risk Ratios in Unilateral and Bilateral Data with Multiple Groups

Well-calibrated confidence measures for multi-label text classification with a large number of labels

Classification of Gaussian spatio-temporal data with stationary separable covariances

Towards Continuous and Ambulatory Blood Pressure Monitoring: Methods for Efficient Data Acquisition for Pulse Transit Time Estimation.

Statistical Agnostic Mapping: A framework in neuroimaging based on concentration inequalities

A Test for Independence in High-Dimensional Normal Data

Covert Device Association Among Colluding Apps via Edge Processor Workload

Secure Information Fusion using Local Posterior for Distributed Cyber-Physical Systems

Computational Dissociation of Dopaminergic and Cholinergic Effects on Action Selection and Inhibitory Control

Evaluating Equivalence Testing Methods for Measurement Invariance

Error rates in proficiency testing in Australia

Cluster-level statistical inference in fMRI datasets: The unexpected behavior of random fields in high dimensions

Class of novel broadband chaos‐based coherent communication systems

The Analysis of Australian Proficiency Test Data over a Ten-Year Period

Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates

Empirical estimation of sequencing error rates using smoothing splines.

Reliable Robust Regression Diagnostics

Chaos‐based BPSK communication system

Definition of loss functions for learning from imbalanced data to minimize evaluation metrics.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Empirical Error Rate Research Articles

Related Topics

Articles published on Empirical Error Rate

Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty.

Homogeneity Test of Many-to-One Relative Risk Ratios in Unilateral and Bilateral Data with Multiple Groups

Well-calibrated confidence measures for multi-label text classification with a large number of labels

Classification of Gaussian spatio-temporal data with stationary separable covariances

Towards Continuous and Ambulatory Blood Pressure Monitoring: Methods for Efficient Data Acquisition for Pulse Transit Time Estimation.

Statistical Agnostic Mapping: A framework in neuroimaging based on concentration inequalities

A Test for Independence in High-Dimensional Normal Data

Covert Device Association Among Colluding Apps via Edge Processor Workload

Secure Information Fusion using Local Posterior for Distributed Cyber-Physical Systems

Computational Dissociation of Dopaminergic and Cholinergic Effects on Action Selection and Inhibitory Control

Evaluating Equivalence Testing Methods for Measurement Invariance

Error rates in proficiency testing in Australia

Cluster-level statistical inference in fMRI datasets: The unexpected behavior of random fields in high dimensions

Class of novel broadband chaos‐based coherent communication systems

The Analysis of Australian Proficiency Test Data over a Ten-Year Period

Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates

Empirical estimation of sequencing error rates using smoothing splines.

Reliable Robust Regression Diagnostics

Chaos‐based BPSK communication system

Definition of loss functions for learning from imbalanced data to minimize evaluation metrics.