A Good-Turing estimator for feature allocation models

Fadhel Ayed,Stefano Favaro,Federico Camerlenghi,Marco Battiston

doi:10.1214/19-ejs1614

Fadhel Ayed, Stefano Favaro + Show 2 more

Open Access

PDF Available

https://doi.org/10.1214/19-ejs1614

Copy DOI

Export

Save

Cite

Journal: Electronic Journal of Statistics	Publication Date: Jan 1, 2019
Citations: 6	License type: cc-by

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume $n$ observable samples and we consider the problem of estimating the expected number $M_{n}$ of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating $M_{n}$ is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator $\hat{M}_{n}$ of $M_{n}$ which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that $\hat{M}_{n}$ admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of $\hat{M}_{n}$ in terms of minimax rate optimality, and we provide with an interesting connection between $\hat{M}_{n}$ and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for $\hat{M}_{n}$, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

Highlights

Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, called features.In particular, every observation is endowed with a finite set of features selected from a collection of features (Fj)j≥1
We show that Mn admits a natural interpretation both as a jackknife estimator (Quenouille [21] and Tukey [25]) and as a nonparametric empirical Bayes estimator in the sense of Efron and Morris [9]
The Good-Turing estimator first appeared in Good [10] as a nonparametric empirical Bayes estimator under the classical multinomial model for species sampling, i.e., (Y1, . . . , Yn) are n random samples from a population of individuals belonging to a collection of species (Sj)j≥1 with unknown proportionsj≥1 such that j≥1 pj = 1

Summary

Introduction

Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, called features. The Bernoulli product model, or binary independence model, is arguably the most popular feature allocation model It models the i-th observation as a sequence Yi = (Yi,j)j≥1 of independent Bernoulli random variables with unknown success probabilities (pj)j≥1, with the assumption that Yr is independent of Ys for any r = s. The Beta prior distribution is a reasonable assumption for neutrally evolving variants but may not be appropriate for deleterious mutations To overcome this drawback, a nonparametric approach to estimate Mn has been proposed in the recent work of Zou et al [28]. Our work delves into the Good-Turing estimator for feature allocation models, providing theoretical guarantees for its use.

A Good-Turing estimator for Mn

Interpretations of Mn

Optimality of Mn

Connection to the Good-Turing estimator for species sampling models

A confidence interval for Mn

A stopping rule for the discovery process

Numerical illustration

Concluding remarks

Nonparametric empirical Bayes

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

A Good-Turing estimator for feature allocation models

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Electronic Journal of Statistics

Lead the way for us

Similar Papers

A comparison of diversity estimators applied to a database of host–parasite associations
Claire S Teitelbaum ... Julie Rushmore
Ecography | VOL. 43
Claire S Teitelbaum, et. al.Claire S Teitelbaum ... Julie Rushmore
09 Jun 2020
Ecography | VOL. 43

Optimizing performance of nonparametric species richness estimators under constrained sampling.
Harshana Rajakaruna ... Sarah A Bailey
Ecology and Evolution | VOL. 6
Harshana Rajakaruna, et. al.Harshana Rajakaruna ... Sarah A Bailey
22 Sep 2016
Ecology and Evolution | VOL. 6

Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list
Anne Chao ... Robert K Colwell
Methods in Ecology and Evolution | VOL. 8
Anne Chao, et. al.Anne Chao ... Robert K Colwell
11 Apr 2017
Methods in Ecology and Evolution | VOL. 8

Splines as Local Smoothers
Douglas Nychka
The Annals of Statistics | VOL. 23
Douglas NychkaDouglas Nychka
01 Aug 1995
The Annals of Statistics | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

A Good-Turing estimator for feature allocation models

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Electronic Journal of Statistics