Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics.

Romain Menegaux,Jean-Philippe Vert

doi:10.1089/cmb.2018.0174

Abstract

We propose a new model for fast classification of DNA sequences output by next-generation sequencing machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k -mers it contains. We show on metagenomics benchmarks that it outperforms the state-of-the-art methods in terms of accuracy and scalability.

Highlights

The cost of DNA sequencing has been divided by 100,000 in the last 10 years
-called long-read technologies are under active development and may become dominant in the future, the current market of DNA sequencing technologies is dominated by so-called next-generation sequencing (NGS) technologies which break long strands of DNA into short fragments of typically 50 to 400 bases each, and "read" the sequence of bases that compose each fragment
After presenting in more detail the model and its optimization, we experimentally study the speed/performance trade-off on metagenomics experiments by varying the embedding dimension, and demonstrate the potential of the approach which outperforms state-of-the-art compositional approaches

Summary

Introduction

The cost of DNA sequencing has been divided by 100,000 in the last 10 years. With less than $1,000 to sequence a human-size genome, it is so cheap that it has quickly become a routine technique to characterize the genome of biological samples with numerous applications in health, food or energy. We investigate the feasibility of directly representing DNA reads as continuous vectors instead, and replacing some discrete operations by continuous calculus in this embedding To illustrate this idea, we focus on an important application in metagenomics, where one sequences the DNA present in an environmental sample to characterize the microbes it contains [21, 4]. We still extract the k-mer composition of each read, but replace the N -dimensional one-hot encoding of each k-mer by a d-dimensional encoding, optimized to solve the task This approach is similar to, e.g., the fastText model for natural language sequences of [7, 3] or word2vec [18], with a different notion of words to embed, and a direct optimization of the classification error to learn the representation. After presenting in more detail the model and its optimization, we experimentally study the speed/performance trade-off on metagenomics experiments by varying the embedding dimension, and demonstrate the potential of the approach which outperforms state-of-the-art compositional approaches

Embedding of DNA reads

Learning the embedding

Implementation

Regularization with noise

Experiments

Reference methods

Memory footprint

Coverage

Performance

Large dataset

Classification speed

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computational Biology	Publication Date: Feb 19, 2019
Citations: 37	License type: cc-by

R Discovery Prime

R Discovery Prime

Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational Biology

Lead the way for us

Similar Papers

Privacy Threat Modeling for Emerging BiobankClouds
Ali Gholami ... Erwin Laure
Procedia Computer Science | VOL. 37
Ali Gholami, et. al.Ali Gholami ... Erwin Laure
01 Jan 2014
Procedia Computer Science | VOL. 37

Preparing a re-sequencing DNA library of 2 cancer candidate genes using the ligation-by-amplification protocol by two PCR reactions
Yeyang Su ... Xingya Xu
Science in China Series C: Life Sciences | VOL. 52
Yeyang Su, et. al.Yeyang Su ... Xingya Xu
01 May 2009
Science in China Series C: Life Sciences | VOL. 52

Universal base analogues and their applications in DNA sequencing technology
Feng Liang ... Peiming Zhang
RSC Advances | VOL. 3
Feng Liang, et. al.Feng Liang ... Peiming Zhang
01 Jan 2013
RSC Advances | VOL. 3

Efficient parallel implementation of the SHRiMP sequence alignment tool using MapReduce
Rawan Alsaad ... Mohamed Abouelhoda
-
Rawan Alsaad, et. al.Rawan Alsaad ... Mohamed Abouelhoda
01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational Biology