Streaming histogram sketching for rapid microbiome analytics

Will Pm Rowe,Alex Shaw,J Simon Kroll,Shabhonam Caim,Anna Paola Carrieri,Lindsay J Hall,Martyn D Winn,Kathleen Sim,Edward O Pyzer-Knapp,Cristina Alcon-Giner

doi:10.1186/s40168-019-0653-2

Abstract

BackgroundThe growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.ResultsWe apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme.Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s.ConclusionsOur method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space. (https://github.com/will-rowe/hulk).

Highlights

The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works
Machine learning (ML) frameworks will struggle to use these de novo outputs as feature vectors due to their scale. This is a potential barrier to the use of these methods in microbiome analytics as machine learning (ML) can help solve many of the data problems encountered in genomics and holds great potential for microbiome analytics [12]
Clustering microbiome datasets We begin by assessing the speed and ability of Histosketching Using Little K-mers (HULK) to cluster metagenomes based on pairwise similarities, and Indexing microbiome collections We test the locality-sensitive hashing (LSH) forest self-tuning indexing scheme as applied to HULK histosketches

Summary

Introduction

The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. This is a potential barrier to the use of these methods in microbiome analytics as ML can help solve many of the data problems encountered in genomics and holds great potential for microbiome analytics [12]

Methods

Results

Discussion

Conclusion