Analysis of heterogeneous genomic samples using image normalization and machine learning

Sunitha Basodi,Yi Pan,Pelin Icer Baykal,Alex Zelikovsky,Pavel Skums

doi:10.1186/s12864-020-6661-6

Sunitha Basodi, Yi Pan + Show 3 more

Open Access

https://doi.org/10.1186/s12864-020-6661-6

Copy DOI

Journal: BMC Genomics	Publication Date: Dec 1, 2020
Citations: 4	License type: open-access

Affiliation: Georgia State University, Sechenov University

Abstract

BackgroundAnalysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures.ResultsWe propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy.ConclusionsSequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.

Highlights

Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems
The data [16, 28] used for classification of intra-host Hepatitis C virus (HCV) populations as recent and chronic consists of 365 nextgeneration sequencing (NGS) samples, including 108 datasets corresponding to recently infected hosts and 257 datasets belonging to chronically infected hosts
Here, we propose a novel method for generation of a fixed set of features representing heterogeneous viral populations, which is widely applicable for various classification and clustering tasks addressed by machine learning

Summary

Introduction

Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures. The paramount feature of these viruses is their extremely high mutation rate caused by error-prone replication, which can be as high as 10-4 mutations per site per replication cycle [1], resulting in generation of all possible single point mutations in each infected individual every day. RNA viruses exist in infected hosts as highly heterogeneous populations of genomic variants usually referred to as viral quasispecies. Challenges associated with understanding complex quasispecies evolution attracted many researchers in different domains, including virology, epidemiology, population genetics and systems biology

Methods

Results

Discussion

Conclusion