Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants

Viachaslau Tsyvina,Seth Sims,Alex Zelikovsky,Yury Khudyakov,David S. Campo,Pavel Skums

doi:10.1186/s12859-018-2333-9

Viachaslau Tsyvina, Seth Sims + Show 4 more

Open Access

https://doi.org/10.1186/s12859-018-2333-9

Copy DOI

Abstract

BackgroundMany biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets.ResultsIn this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. It shows better filtering quality and time performance when comparing to other tools. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sjConclusionThe proposed tool allows for efficient detection of genetic relatedness between genomic samples produced by deep sequencing of heterogeneous populations. It should be especially useful for analysis of relatedness of genomes of viruses with unevenly distributed variable genomic regions, such as HIV and HCV. For the future we envision, that besides applications in molecular epidemiology the tool can also be adapted to immunosequencing and metagenomics data.

Highlights

Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS)
We describe a tool which uses k-mer-based signature filtering scheme optimized for viral data to solve the following problems:
Each sample was processed by error correction and haplotyping tools, and as a result we receive as an input datasets consisting of unique Hepatitis C Virus (HCV) haplotypes

Summary

Introduction

Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). The similarity join problem consists in locating the set P of all pairs of sequences, with one sequence from T1 and the other from T2, within an edit distance or Hamming distance defined by the specified threshold t In molecular epidemiology, this computational problem needs to be solved for detection of viral transmissions from sequences of intra-host viral variants. The distance calculation can be carried out with a small subset of diagonals neighboring the main diagonal of the dynamic programming matrix, leading to O(tL) time algorithm [12]. In this case a naïve algorithm for the similarity join problem requiring pairwise comparison of all sequences has an asymptotic running time O(tLN2), which is still impractical for more than several thousand sequences

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 1, 2018
Citations: 4	License type: open-access

R Discovery Prime

R Discovery Prime

Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Inter- and Intra-Host Nucleotide Variations in Hepatitis A Virus in Culture and Clinical Samples Detected by Next-Generation Sequencing.
Zhihui Yang ... Michael Kulka
Viruses | VOL. 10
Zhihui Yang, et. al.Zhihui Yang ... Michael Kulka
09 Nov 2018
Viruses | VOL. 10

Correction: Characterization of intra- and inter-host norovirus P2 genetic variability in linked individuals by amplicon sequencing.
Aurora Sabrià ... Rosa M Pintó
PLOS ONE | VOL. 13
Aurora Sabrià, et. al.Aurora Sabrià ... Rosa M Pintó
19 Dec 2018
PLOS ONE | VOL. 13

Characterization of intra- and inter-host norovirus P2 genetic variability in linked individuals by amplicon sequencing.
Aurora Sabrià ... Yury E Khudyakov
PLOS ONE | VOL. 13
Aurora Sabrià, et. al.Aurora Sabrià ... Yury E Khudyakov
09 Aug 2018
PLOS ONE | VOL. 13

Advances in research and application of immune repertoire analysis in hematological malignancies
...
Journal of Leukemia and Lymphoma | VOL. 26
, et. al. ...
25 Feb 2017
Journal of Leukemia and Lymphoma | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics