A Random Categorization Model for Hierarchical Taxonomies

Guido D’Amico,Raul Rabadan,Matthew Kleban

doi:10.1038/s41598-017-17168-6

Abstract

A taxonomy is a standardized framework to classify and organize items into categories. Hierarchical taxonomies are ubiquitous, ranging from the classification of organisms to the file system on a computer. Characterizing the typical distribution of items within taxonomic categories is an important question with applications in many disciplines. Ecologists have long sought to account for the patterns observed in species-abundance distributions (the number of individuals per species found in some sample), and computer scientists study the distribution of files per directory. Is there a universal statistical distribution describing how many items are typically found in each category in large taxonomies? Here, we analyze a wide array of large, real-world datasets – including items lost and found on the New York City transit system, library books, and a bacterial microbiome – and discover such an underlying commonality. A simple, non-parametric branching model that randomly categorizes items and takes as input only the total number of items and the total number of categories is quite successful in reproducing the observed abundance distributions. This result may shed light on patterns in species-abundance distributions long observed in ecology. The model also predicts the number of taxonomic categories that remain unrepresented in a finite sample.

Highlights

A taxonomy is a standardized framework to classify and organize items into categories
Characterizing the distribution of the number of individuals of each species found in some area or sample – the so-called species abundance distribution (SAD) – is a long standing problem in ecology3
Large hierarchical classification systems are ubiquitous, and our results illuminate a distinct and novel pattern in the distribution of items among categories in a broad array of such systems that cut across many fields

Summary

Introduction

A taxonomy is a standardized framework to classify and organize items into categories. We study the distribution of items among categories in a variety of large hierarchical classification systems, including disease incidence in 250 million patients, bacterial microbiomes, items for sale on www.amazon.com, books in the Harvard University library system, files per directory on a laptop, and items lost and found on the New York City transit system (Fig. 1(c) and Table 1). In all these datasets, a few categories are very popular while many categories contain only a few items. This variation is correlated to how well-sampled the distribution is: datasets with a sufficiently large number of items relative to the number of categories are close to Gaussian, while those with many categories containing only a few items are skewed (Fig. 1(b))

Methods

Results

Conclusion