Abstract

Analysis of metagenomic data is not only challenging because they are acquired from a sample in their natural habitats but also because of the high volume and high dimensionality. The fact that no prior lab based cultivation is carried out in metagenomics makes the inference on the presence of numerous microorganisms all the more challenging, accentuating the need for an informative visualization of this data. In a successful visualization, the congruent reads of the sequences should appear in clusters depending on the diversity and taxonomy of the microorganisms in the sequenced sample. The metagenomic data represented by their oligonucleotide frequency vectors is inherently high dimensional and therefore impossible to visualize as is. This raises the need for a dimensionality reduction technique to convert these higher dimensional sequence data into lower dimensional data for visualization purposes. In this process, preservation of the genomic characteristics must be given highest priority. Currently, for dimensionality reduction purposes in metagenomics, Principal Component Analysis (PCA) which is a linear technique and t-distributed Stochastic Neighbor Embedding (t-SNE), a non-linear technique, are widely used. Albeit their wide use, these techniques are not exceptionally suited to the domain of metagenomics with certain shortcomings and weaknesses. Our research explores the possibility of using autoencoders, a deep learning technique, that has the potential to overcome the prevailing impediments of the existing dimensionality reduction techniques eventually leading to richer visualizations.

Highlights

  • The field of metagenomics has shown popular interest among bioinformatics and computer science researchers in the recent years

  • As we identified the following configurations worked the best for respectively 3-mers and 4mers, 1. {32, 16, 2, 16, 32} 2. {136, 64, 2, 64, 136} The results we obtained using these autoencoders are presented in the paper

  • The results obtained from the research backed by the superior results obtained by autoencoders back the potential of using autoencoders in the field of metagenomics for dimensionality reduction and visualization of metagenomic reads

Read more

Summary

Introduction

The field of metagenomics has shown popular interest among bioinformatics and computer science researchers in the recent years. It has opened up new pathways in many areas including population-level genomic diversity of the microbial organisms. Acquiring oligonucleotide frequencies of the microbial organisms is a widely used method that identifies the nucleotide composition with much better accuracy and effectiveness, compared to %GC [2]. Contemporary studies have shown that the oligonucleotide frequencies as they appear in genomic sequences is unique for a given microorganism. Research on this which runs back to 1960s, showcase the fact that oligonucleotide frequencies having species-specific signatures [3]. An array of all oligonucleotide frequencies for a given length provides genomic signatures for microorganisms

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call