Nonparametric Assessment of Contamination in Multivariate Data Using Generalized Quantile Sets and FDR

Clayton Scott,Eric Kolaczyk

doi:10.1198/jcgs.2010.08092

Abstract

Large, multivariate datasets from high-throughput instrumentation have become ubiquitous in the sciences. Frequently, it is of interest to characterize the measurements in these datasets by the extent to which they represent ‘nominal’ versus ‘contaminated’ instances. However, often the nature of even the nominal patterns in the data is unknown and potentially quite complex, making their explicit parametric modeling a daunting task. In this article, we introduce a nonparametric method for the simultaneous annotation of multivariate data (called MN-SCAnn), by which one may produce an annotated ranking of the observations, indicating the relative extent to which each may or may not be considered nominal, while making minimal assumptions on the nature of the nominal distribution. In our framework each observation is linked to a corresponding generalized quantile set and, implicitly adopting a hypothesis testing perspective, each set is associated with a test, which in turn is accompanied by a certain false discovery rate. The combination of generalized quantile set methods with false discovery rate principles, in the context of contaminated data, is new, and estimation of the key underlying quantities requires that a number of issues be addressed. We illustrate MN-SCAnn through examples in two contexts: the preprocessing of flow cytometry data in bioinformatics, and the detection of anomalous traffic patterns in Internet measurement studies. This article has supplementary material online.

Full Text