Abstract

DNA sequence has several representations; one of them is to split it into k-mers components. In this work, we explore the high similarity between natural language and “genomic sequence language” which are both character-based languages, to represent DNA sequences. In this representation, we processed a DNA sequence as a set of overlapping word embeddings using the Global Vectors representation. In Natural language processing context, we can consider k-mers as words. The embedding representation of k-mers helped to overcome the curse of dimensionality, which is one of the main issues of traditional methods that encode k-mers occurrence as one hot vector. Experiments on the first Critical Assessment of Metagenome Interpretation (CAMI) dataset demonstrated that our method is an efficient way to cluster metagenomics reads and predict their taxonomy. This method could be used as first step for metagenomics downstream analysis.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.