A random forest classifier for detecting rare variants in NGS data from viral populations

Raunaq Malhotra,Manjari Jha,Mary Poss,Raj Acharya

doi:10.1016/j.csbj.2017.07.001

Abstract

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame.We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.

Highlights

The sequence diversity present in a population of closely related genomes is important for their survival under environmental pressures
As these tools are traditionally designed for error correction, the error corrected reads or k-mers from these methods were used for comparison with the rare variant k-mers and common k-mers predicted by MultiRes
As one of the objectives in Next Generation Sequencing (NGS) studies of viruses is to identify the single nucleotide polymorphisms (SNPs) in a population [2,7,9] which is sensitive to erroneous reads, we evaluate the inference of SNPs from the k-mers predicted by MultiRes, and compare it to known SNP profiling methods for viral populations

Summary

Introduction

The sequence diversity present in a population of closely related genomes is important for their survival under environmental pressures. Viral population within a host is an example of such population of closely related genomes, where some viral strains survive even when large segments of their genome are deleted. In order to remove sequencing errors from NGS data, the first step is detecting the errors from true biological sequences and correcting the errors to the true sequence. For NGS data obtained from a viral population, the reads are mapped to a reference genome to detect true variants from sequencing errors based on a probabilistic model [6,7,9,10,11], and the sequencing errors are corrected to the sequence of the reference genome. As virus population contains a large diversity of true sequences, accurate mapping of reads to any one reference may not be possible

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational and Structural Biotechnology Journal	Publication Date: Jan 1, 2017
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

A random forest classifier for detecting rare variants in NGS data from viral populations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational and Structural Biotechnology Journal

Lead the way for us

Similar Papers

Sequence Kernel Association Tests for the Combined Effect of Rare and Common Variants
Iuliana Ionita-Laza ... Xihong Lin
The American Journal of Human Genetics | VOL. 92
Iuliana Ionita-Laza, et. al.Iuliana Ionita-Laza ... Xihong Lin
16 May 2013
The American Journal of Human Genetics | VOL. 92

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
Jiaqi Liu ... Daocheng Dai
BMC Genomics | VOL. 21
Jiaqi Liu, et. al.Jiaqi Liu ... Daocheng Dai
01 Nov 2020
BMC Genomics | VOL. 21

Exact association test for small size sequencing data
Joowon Lee ... Seungyeoun Lee
BMC Medical Genomics | VOL. 11
Joowon Lee, et. al.Joowon Lee ... Seungyeoun Lee
01 Apr 2018
BMC Medical Genomics | VOL. 11

Accurate viral population assembly from ultra-deep sequencing data.
Serghei Mangul ... Nicholas Mancuso
Bioinformatics | VOL. 30
Serghei Mangul, et. al.Serghei Mangul ... Nicholas Mancuso
11 Jun 2014
Bioinformatics | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A random forest classifier for detecting rare variants in NGS data from viral populations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational and Structural Biotechnology Journal