Abstract
We propose a new model for fast classification of DNA sequences output by next-generation sequencing machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k -mers it contains. We show on metagenomics benchmarks that it outperforms the state-of-the-art methods in terms of accuracy and scalability.
Highlights
The cost of DNA sequencing has been divided by 100,000 in the last 10 years
-called long-read technologies are under active development and may become dominant in the future, the current market of DNA sequencing technologies is dominated by so-called next-generation sequencing (NGS) technologies which break long strands of DNA into short fragments of typically 50 to 400 bases each, and "read" the sequence of bases that compose each fragment
After presenting in more detail the model and its optimization, we experimentally study the speed/performance trade-off on metagenomics experiments by varying the embedding dimension, and demonstrate the potential of the approach which outperforms state-of-the-art compositional approaches
Summary
The cost of DNA sequencing has been divided by 100,000 in the last 10 years. With less than $1,000 to sequence a human-size genome, it is so cheap that it has quickly become a routine technique to characterize the genome of biological samples with numerous applications in health, food or energy. We investigate the feasibility of directly representing DNA reads as continuous vectors instead, and replacing some discrete operations by continuous calculus in this embedding To illustrate this idea, we focus on an important application in metagenomics, where one sequences the DNA present in an environmental sample to characterize the microbes it contains [21, 4]. We still extract the k-mer composition of each read, but replace the N -dimensional one-hot encoding of each k-mer by a d-dimensional encoding, optimized to solve the task This approach is similar to, e.g., the fastText model for natural language sequences of [7, 3] or word2vec [18], with a different notion of words to embed, and a direct optimization of the classification error to learn the representation. After presenting in more detail the model and its optimization, we experimentally study the speed/performance trade-off on metagenomics experiments by varying the embedding dimension, and demonstrate the potential of the approach which outperforms state-of-the-art compositional approaches
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.