Abstract

Simple SummaryTechnological advancements have led to modern DNA sequencing methods, capable of generating large amounts of data describing the microorganisms that live in samples taken from the environment. Metagenomics, the field that studies the different genomes within these samples, is becoming increasingly popular, as it has many real-world applications, such as the discovery of new antibiotics, personalized medicine, forensics, and many more. From a computer science point of view, it is interesting to see how these large volumes of data can be processed efficiently to accurately identify (classify) the microorganisms from the input DNA data. This scoping review aims to give an insight into the existing state of the art computational methods for processing metagenomic data through the prism of machine learning, data science, and big data. We provide an overview of the state of the art metagenomic classification methods, as well as the challenges researchers face when tackling this complex problem. The end goal of this review is to help researchers be up to date with current trends, as well as identify opportunities for further research and improvements.Applied machine learning in bioinformatics is growing as computer science slowly invades all research spheres. With the arrival of modern next-generation DNA sequencing algorithms, metagenomics is becoming an increasingly interesting research field as it finds countless practical applications exploiting the vast amounts of generated data. This study aims to scope the scientific literature in the field of metagenomic classification in the time interval 2008–2019 and provide an evolutionary timeline of data processing and machine learning in this field. This study follows the scoping review methodology and PRISMA guidelines to identify and process the available literature. Natural Language Processing (NLP) is deployed to ensure efficient and exhaustive search of the literary corpus of three large digital libraries: IEEE, PubMed, and Springer. The search is based on keywords and properties looked up using the digital libraries’ search engines. The scoping review results reveal an increasing number of research papers related to metagenomic classification over the past decade. The research is mainly focused on metagenomic classifiers, identifying scope specific metrics for model evaluation, data set sanitization, and dimensionality reduction. Out of all of these subproblems, data preprocessing is the least researched with considerable potential for improvement.

Highlights

  • Metagenomics is becoming an increasingly popular field in bioinformatics since with the evolution of technology and machine learning models, we are able to create increasingly more competent models to tackle the problems of DNA sequencing and genome classification

  • If we look at the number of relevant articles per year from each source presented in Figure 3, it can be seen that PubMed is consistently the prevalent digital library source when it comes to metagenomic sequencing; Springer was narrowing this gap in 2019

  • Each of the property groups has a dedicated section for discussing the latest trends, tools, and inventions that relate to our primary research topic: metagenomic classification

Read more

Summary

Introduction

Metagenomics is becoming an increasingly popular field in bioinformatics since with the evolution of technology and machine learning models, we are able to create increasingly more competent models to tackle the problems of DNA sequencing and genome classification. Metagenomics deals with samples from the environment that likely contain many organisms. The goal in this case is to analyze the different genomes within this environmental sample. As a result of the latter, we can process much larger quantities of data and train more complex machine learning models that were previously not feasible due to hardware limitations. This opens the gates for metagenomics to be one of the most trending topics in Big Data, as it can be used extensively in medicine. Such exemplary applications are in the identification of novel biocatalysts and the discovery of new antibiotics [2], as well as personalized medicine [3,4,5], bioremediation of industrial, agricultural, and domestic wastes [6,7], resulting in a reduction of environmental pollution, as well as forensics [8]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call