Abstract

Metagenomics studies, as well as genomics studies of polyploid species such as wheat, deal with the analysis of high variation data. Such data contain sequences from similar, but distinct genetic chains. This fact presents an obstacle to analysis and research. In particular, the detection of instrumentation errors during the digitalization of the sequences may be hindered, as they can be indistinguishable from the real biological variation inside the digital data. This can prevent the determination of the correct sequences, while at the same time make variant studies significantly more difficult. This paper details a collection of ML-based models used to distinguish a real variant from an erroneous one. The focus is on using this model directly, but experiments are also done in combination with other predictors that isolate a pool of error candidates.

Highlights

  • Metagenomics studies and studies of polyploid organisms are two areas of research in genomic analysis that are forced to deal with datasets containing high variation among the individual genetic sequences, which negatively affects the data analysis.Metagenomics studies communities of micro-organisms, like the ones forming the human microbiome [1] or the microbiome of urban environments [2], and are the focus of studies of viral evolution [3]

  • Understanding of the human microbiome is essential for the future of medicine [4], is a significant part of nutrition studies [5], affects human space flight [6], and the proper study of disease outbreaks [7]; metagenomics studies are critical for agricultural development [8]

  • To verify the chosen design of the machine learning (ML) input examples, as well as the non-standard way to craft training examples, models constructed using them were tested on metagenomics data and, through a later experiment, on hexaploid wheat data

Read more

Summary

Introduction

Metagenomics studies and studies of polyploid organisms are two areas of research in genomic analysis that are forced to deal with datasets containing high variation among the individual genetic sequences, which negatively affects the data analysis. Metagenomics studies communities of micro-organisms, like the ones forming the human microbiome [1] or the microbiome of urban environments [2], and are the focus of studies of viral evolution [3]. The presence of genetic sequences from multiple similar organisms in the same sample—a feature of great importance from a scientific standpoint—is a big obstacle to the data analysis. The inevitable presence of machine errors in the data, which can be masquerading as legitimate inter-species variation, cannot always be accounted for. The errors cause inaccuracies in the final study results [9,10] and are often resolved by discarding reads suspected to contain errors [11] in their entirety

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.