Large-scale mode identification and data-driven sciences

Subhadeep Mukhopadhyay

doi:10.1214/17-ejs1229

Abstract

Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-case investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode–an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.

Highlights

Many scientific problems seek to identify modes in the true unknown probability density function f (x) of a variable X, given i.i.d observations X1, . . . , Xn
Two different classes of bump-hunting methods are currently prevailing in the literature, which provide insights at different levels of granularity and details: (i) testing multimodality or deviation from unimodality; (ii) determining how many modes are present in a probability density function
The idea of using kernel density for nonparametric mode identification goes back to the seminal work of Parzen (1962). This was furthered studied by Silverman (1981) based on the concept of “critical bandwidths” and bootstrapping, which is known to be highly conservative, non-robust, and generate different answers based on various calibration techniques

Summary

Introduction

Many scientific problems seek to identify modes in the true unknown probability density function f (x) of a variable X, given i.i.d observations X1, . . . , Xn. Many scientific problems seek to identify modes in the true unknown probability density function f (x) of a variable X, given i.i.d observations X1, . The goal is to learn and compare the multi-modality shape of each variables. This problem of finding structures in the form of hidden bumps arises in many data-intensive sciences. We address the intellectual challenge of developing novel algorithm for ‘large-scale nonparametric mode exploration’–a problem of outstanding interest at the present time. Two different classes of bump-hunting methods are currently prevailing in the literature, which provide insights at different levels of granularity and details: (i) testing multimodality or deviation from unimodality; (ii) determining how many modes are present in a probability density function. The purpose of this paper is to present a new genre of nonparametric mode identification technique for (iii) comprehensive mode identification: determining number of modes (along with locations), as well as standard errors or confidence intervals of the associated mode positions to assess significance and uncertainty

Two modeling cultures

Skew-G density representation

Constructing empirical orthogonal rank polynomials

Estimation and properties

Model denoising

Consistency of local mode estimates

LPMode algorithm and inference

Econometrics

Cancer genomics

Asteroid data

Galaxy color data

Analytical chemistry

Biological science

Philately

Ecological science

Simulation studies

Discussion

Findings

Methods

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronic Journal of Statistics	Publication Date: Jan 1, 2017
Citations: 19	License type: cc-by

R Discovery Prime

R Discovery Prime

Large-scale mode identification and data-driven sciences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics

Lead the way for us

Similar Papers

Data Science of the Natural Environment: A Research Roadmap
Gordon S Blair ... Peter Henrys
Frontiers in Environmental Science | VOL. 7
Gordon S Blair, et. al.Gordon S Blair ... Peter Henrys
14 Aug 2019
Frontiers in Environmental Science | VOL. 7

Environmental Data Science Book: a community-driven resource showcasing open-source Environmental science
Alejandro Coca-Castro ... The Environmental Data Science Community
-
Alejandro Coca-Castro, et. al.Alejandro Coca-Castro ... The Environmental Data Science Community
27 Mar 2022
27 Mar 2022

Virtual Labs for Collaborative Environmental Data Science
Maria Salama ... Mike Brown
-
Maria Salama, et. al.Maria Salama ... Mike Brown
28 Mar 2022
28 Mar 2022

Data Science and Human-Environment Systems
Steven M Manson
-
Steven M MansonSteven M Manson
31 Jan 2023
31 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large-scale mode identification and data-driven sciences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics