Abstract

In this paper, we reconsider the problem of distance to uniformity estimation of discrete distributions. As a fundamental problem in distribution property estimation, the problem with known alphabet size has been addressed in [1] [3] and is fairly well understood. In particular, let $k$ be the alphabet size and $\epsilon$ be the error tolerance parameter, people have shown that the corresponding $\epsilon$ -minimax sample complexity, i.e., the minimum sample size that is sufficient for achieving an estimation error of $\epsilon$ even in the worst case, is $\Theta(k/(\epsilon^{2}\log k))$ . Surprisingly, the natural setting where the distribution is over an alphabet of unknown size has not been studied. In this work, we propose and study the well-motivated yet unexplored problem of estimating the generalized distance to uniformity, i.e., the distance of an unknown distribution to the closest uniform distribution. We provide both upper and lower bounds for its $(S,\epsilon)$ -minimax sample complexity. Specifically, let $p$ be the underlying distribution and $S(p)$ be the support size of the closest uniform distribution to $p.\mathbf{For} \epsilon\in(4/\sqrt{\log S(p)}, 1]$ , we present an estimator, that takes $\mathcal{O}(S(p)/(\epsilon^{3}\log\check{S}(p))$ independent samples from the underlying distribution, with probability 2/3, estimates its generalized distance to uniformity up to an additive error of $\epsilon$ without knowing the alphabet $\Omega$ or the support size $S(p)$ . In addition, the estimator can be computed in nearly linear time in the sample size. In the typical high precision regime where $\epsilon\in(0,0.15)$ , we show that the existence of an $\epsilon$ -adaptive estimator implies a lower bound of $\Omega_{\epsilon}(S/\log S)$ on the maximum $(S^{\prime},\epsilon)$ -minimax sample complexity over $[S/2,2S]$ .

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.