ABayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity.

  • Abstract
  • Literature Map
  • References
  • Citations
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

RNA viruses replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, studying the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations. However, accurate reconstruction of viral quasispecies remains difficult due to limited read lengths and presence of sequencing errors. The problem is particularly challenging when the strains in a population are highly similar, that is, the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. In this article, we present a novel viral quasispecies reconstruction algorithm, aBayesQR, that uses a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data. The search for the most likely quasispecies is conducted on long contigs that our method constructs from the set of short reads via agglomerative hierarchical clustering; operating on contigs rather than short reads enables identification of close strains in a population and provides computational tractability of the Bayesian method. Results on both simulated and real HIV-1 data demonstrate that the proposed algorithm generally outperforms state-of-the-art methods; aBayesQR particularly stands out when reconstructing a set of closely related viral strains (e.g., quasispecies characterized by low diversity).

ReferencesShowing 10 of 19 papers
  • Cite Count Icon 37
  • 10.1007/978-3-540-79450-9_15
HCV Quasispecies Assembly Using Network Flows
  • Jan 12, 2017
  • Kelly Westbrooks + 5 more

  • Open Access Icon
  • Cite Count Icon 54
  • 10.1093/bioinformatics/btu295
Accurate viral population assembly from ultra-deep sequencing data.
  • Jun 11, 2014
  • Bioinformatics
  • Serghei Mangul + 5 more

  • Open Access Icon
  • Cite Count Icon 210
  • 10.1093/bib/3.1.23
Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem.
  • Jan 1, 2002
  • Briefings in Bioinformatics
  • R Lippert

  • Cite Count Icon 161
  • 10.1007/3-540-44676-1_15
SNPs Problems, Complexity, and Algorithms
  • Jan 1, 2001
  • Giuseppe Lancia + 4 more

  • Cite Count Icon 167
RNA virus quasispecies: significance for viral disease and epidemiology.
  • Aug 1, 1994
  • Infectious agents and disease
  • E Domingo + 9 more

  • Cite Count Icon 89
  • 10.1109/tcbb.2013.145
HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model.
  • Jan 1, 2014
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • Sandhya Prabhakaran + 4 more

  • Open Access Icon
  • Cite Count Icon 143
  • 10.1093/nar/gku537
Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations
  • Jun 27, 2014
  • Nucleic Acids Research
  • Francesca Di Giallonardo + 18 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 81
  • 10.1186/1471-2105-12-5
Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing
  • Jan 5, 2011
  • BMC Bioinformatics
  • Mattia Cf Prosperi + 8 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 121
  • 10.1371/journal.pone.0006079
Low-Abundance HIV Drug-Resistant Viral Variants in Treatment-Experienced Persons Correlate with Historical Antiretroviral Use
  • Jun 29, 2009
  • PLoS ONE
  • Thuy Le + 8 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 267
  • 10.1186/1471-2105-12-119
ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data
  • Apr 26, 2011
  • BMC Bioinformatics
  • Osvaldo Zagordi + 3 more

CitationsShowing 10 of 28 papers
  • Research Article
  • 10.1371/journal.pcbi.1013360.r008
NextVir: Enabling classification of tumor-causing viruses with genomic foundation models
  • Aug 21, 2025
  • PLOS Computational Biology
  • John Robertson + 3 more

Motivation: Oncoviruses, pathogens known to cause or increase the risk of cancer, include both common viruses such as human papillomaviruses and rarer pathogens such as human T-lymphotropic viruses. Computational methods for detecting viral DNA from data acquired by modern DNA sequencing technologies have enabled studies of the association between oncoviruses and cancers. Those studies are rendered particularly challenging when multiple species of oncovirus are present in a tumor sample. In such scenarios, merely detecting the presence of a sequencing read of viral origin is insufficiently informative—instead, a more precise characterization of the viral content in the sample is required.Results: We address this need with NextVir, to our knowledge the first multi-class viral classification framework that adapts genomic foundation models to detecting and classifying sequencing reads of oncoviral origin. Specifically, NextVir explores several foundation models—DNABERT-S, Nucelotide Transformer, and HyenaDNA—and efficiently fine-tunes them to enable accurate identification of the sequencing reads’ origin. The results demonstrate superior performance of the proposed framework over existing deep learning methods and suggest downstream potential for foundational models in genomics.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1186/s12859-022-05100-3
VirPool: model-based estimation of SARS-CoV-2 variant proportions in wastewater samples
  • Dec 19, 2022
  • BMC Bioinformatics
  • Askar Gafurov + 8 more

BackgroundThe genomes of SARS-CoV-2 are classified into variants, some of which are monitored as variants of concern (e.g. the Delta variant B.1.617.2 or Omicron variant B.1.1.529). Proportions of these variants circulating in a human population are typically estimated by large-scale sequencing of individual patient samples. Sequencing a mixture of SARS-CoV-2 RNA molecules from wastewater provides a cost-effective alternative, but requires methods for estimating variant proportions in a mixed sample.ResultsWe propose a new method based on a probabilistic model of sequencing reads, capturing sequence diversity present within individual variants, as well as sequencing errors. The algorithm is implemented in an open source Python program called VirPool. We evaluate the accuracy of VirPool on several simulated and real sequencing data sets from both Illumina and nanopore sequencing platforms, including wastewater samples from Austria and France monitoring the onset of the Alpha variant.ConclusionsVirPool is a versatile tool for wastewater and other mixed-sample analysis that can handle both short- and long-read sequencing data. Our approach does not require pre-selection of characteristic mutations for variant profiles, it is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1093/gigascience/giae065
V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation.
  • Jan 2, 2024
  • GigaScience
  • Lara Fuhrmann + 18 more

The large amount and diversity of viral genomic datasets generated by next-generation sequencing technologies poses a set of challenges for computational data analysis workflows, including rigorous quality control, scaling to large sample sizes, and tailored steps for specific applications. Here, we present V-pipe 3.0, a computational pipeline designed for analyzing next-generation sequencing data of short viral genomes. It is developed to enable reproducible, scalable, adaptable, and transparent inference of genetic diversity of viral samples. By presenting 2 large-scale data analysis projects, we demonstrate the effectiveness of V-pipe 3.0 in supporting sustainable viral genomic data science.

  • Open Access Icon
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1093/gbe/evz069
Mutagenesis Scanning Uncovers Evolutionary Constraints on Tobacco Etch Potyvirus Membrane-Associated 6K2 Protein.
  • Mar 27, 2019
  • Genome Biology and Evolution
  • Rubén González + 4 more

RNA virus high mutation rate is a double-edged sword. At the one side, most mutations jeopardize proteins functions; at the other side, mutations are needed to fuel adaptation. The relevant question then is the ratio between beneficial and deleterious mutations. To evaluate this ratio, we created a mutant library of the 6K2 gene of tobacco etch potyvirus that contains every possible single-nucleotide substitution. 6K2 protein anchors the virus replication complex to the network of endoplasmic reticulum membranes. The library was inoculated into the natural host Nicotiana tabacum, allowing competition among all these mutants and selection of those that are potentially viable. We identified 11 nonsynonymous mutations that remain in the viral population at measurable frequencies and evaluated their fitness. Some had fitness values higher than the wild-type and some were deleterious. The effect of these mutations in the structure, transmembrane properties, and function of 6K2 was evaluated in silico. In parallel, the effect of these mutations in infectivity, virus accumulation, symptoms development, and subcellular localization was evaluated in the natural host. The α-helix H1 in the N-terminal part of 6K2 turned out to be under purifying selection, while most observed mutations affect the link between transmembrane α-helices H2 and H3, fusing them into a longer helix and increasing its rigidity. In general, these changes are associated with higher within-host fitness and development of milder or no symptoms. This finding suggests that in nature selection upon 6K2 may result from a tradeoff between within-host accumulation and severity of symptoms.

  • Research Article
  • 10.1007/978-1-0716-4702-8_6
Algorithms for Short-Read Viral Haplotype Reconstruction: Challenges, Solutions, and Perspectives.
  • Feb 24, 2012
  • Methods in molecular biology (Clifton, N.J.)
  • Wing-Yan Joyce Sung + 1 more

RNA viruses, such as HIV, HCV, and SARS-CoV-2, show high levels of intrahost genetic diversity. Many different haplotypes can be present in a single infection, which can be studied using next-generation sequencing. However, full-length haplotype reconstruction from short reads is computationally challenging due to the presence of low-frequency mutants, as well as sequencing errors. Moreover, reads may not be long enough to span regions between neighboring mutations. Finally, the sequencing depths needed to discover such low-frequency mutants result in large datasets, which require highly efficient algorithms. In this review, we provide an overview of current strategies to address these challenges and identify potential directions for increasing the accuracy and efficiency of viral haplotype reconstruction. Such developments will be key to advancing our understanding of viral evolution, improving treatment strategies, and informing public health interventions.

  • Open Access Icon
  • Preprint Article
  • Cite Count Icon 3
  • 10.1101/2020.09.29.318642
A Convolutional Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
  • Oct 1, 2020
  • Ziqi Ke + 1 more

Abstract Haplotype assembly and viral quasispecies reconstruction are challenging tasks concerned with analysis of genomic mixtures using sequencing data. High-throughput sequencing technologies generate enormous amounts of short fragments (reads) which essentially oversample components of a mixture; the representation redundancy enables reconstruction of the components (haplotypes, viral strains). The reconstruction problem, known to be NP-hard, boils down to grouping together reads originating from the same component in a mixture. Existing methods struggle to solve this problem with required level of accuracy and low runtimes; the problem is becoming increasingly more challenging as the number and length of the components increase. This paper proposes a read clustering method based on a convolutional auto-encoder designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features. The components are reconstructed by finding consensus sequences that agglomerate reads from the same origin. Mini-batch stochastic gradient descent and dimension reduction of reads allow the proposed method to efficiently deal with massive numbers of long reads. Experiments on simulated, semi-experimental and experimental data demonstrate the ability of the proposed method to accurately reconstruct haplotypes and viral quasispecies, often demonstrating superior performance compared to state-of-the-art methods.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3389/fcimb.2021.715143
Viral Haplotypes in COVID-19 Patients Associated With Prolonged Viral Shedding
  • Nov 3, 2021
  • Frontiers in Cellular and Infection Microbiology
  • Yingping Wu + 10 more

BackgroundRecently, more patients who recovered from the novel coronavirus disease 2019 (COVID-19) may later test positive for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) again using reverse transcription-polymerase chain reaction (RT-PCR) testing. Even though it is still controversial about the possible explanation for clinical cases of long-term viral shedding, it remains unclear whether the persistent viral shedding means re-infection or recurrence.MethodsSpecimens were collected from three COVID-19-confirmed patients, and whole-genome sequencing was performed on these clinical specimens during their first hospital admission with a high viral load of SARS-CoV-2. Laboratory tests were examined and analyzed throughout the whole course of the disease. Phylogenetic analysis was carried out for SARS-CoV-2 haplotypes.ResultsWe found haplotypes of SARS-CoV-2 co-infection in two COVID-19 patients (YW01 and YW03) with a long period of hospitalization. However, only one haplotype was observed in the other patient with chronic lymphocytic leukemia (YW02), which was verified as one kind of viral haplotype. Patients YW01 and YW02 were admitted to the hospital after being infected with COVID-19 as members of a family cluster, but they had different haplotype characteristics in the early stage of infection; YW01 and YW03 were from different infection sources; however, similar haplotypes were found together.ConclusionThese findings show that haplotype diversity of SARS-CoV-2 may result in viral adaptation for persistent shedding in multiple recurrences of COVID-19 patients, who met the discharge requirement. However, the correlation between haplotype diversity of SARS-CoV-2 virus and immune status is not absolute. It showed important implications for the clinical management strategies for COVID-19 patients with long-term hospitalization or cases of recurrence.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 8
  • 10.1093/bioinformatics/btab076
WgLink: reconstructing whole-genome viral haplotypes using L0+L1-regularization.
  • Feb 3, 2021
  • Bioinformatics
  • Chen Cao + 2 more

Many tools can reconstruct viral sequences based on next-generation sequencing reads. Although existing tools effectively recover local regions, their accuracy suffers when reconstructing the whole viral genomes (strains). Moreover, they consume significant memory when the sequencing coverage is high or when the genome size is large. We present WgLink to meet this challenge. WgLink takes local reconstructions produced by other tools as input and patches the resulting segments together into coherent whole-genome strains. We accomplish this using an L0+L1-regularized regression, synthesizing variant allele frequency data with physical linkage between multiple variants spanning multiple regions simultaneously. WgLink achieves higher accuracy than existing tools both on simulated and on real datasets while using significantly less memory (RAM) and fewer CPU hours. Source code and binaries are freely available at https://github.com/theLongLab/wglink. Supplementary data are available at Bioinformatics online.

  • Preprint Article
  • 10.1101/2021.08.26.457874
From Alpha to Zeta: Identifying variants and subtypes of SARS-CoV-2 via clustering
  • Aug 27, 2021
  • Andrew Melnyk + 7 more

Abstract The availability of millions of SARS-CoV-2 sequences in public databases such as GISAID and EMBL-EBI (UK) allows a detailed study of the evolution, genomic diversity and dynamics of a virus like never before. Here we identify novel variants and sub-types of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intra-host viral populations. We asses our results using clustering entropy — the first time it has been used in this context.Our clustering approach reaches lower entropies compared to other methods, and we are able to boost this even further through gap filling and Monte Carlo based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the UK and GISAID datasets, but is also able to detect the much less represented (< 1% of the sequences) Beta (South Africa), Epsilon (California), Gamma and Zeta (Brazil) variants in the GISAID dataset. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large datasets.

  • Research Article
  • 10.1089/cmb.2025.0075
XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples.
  • May 20, 2025
  • Journal of computational biology : a journal of computational molecular cell biology
  • Shorya Consul + 2 more

It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the advancements in sequencing technologies, has allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult, and the training of machine learning models that enable such analysis computationally challenging. We introduce XVir, a data pipeline that deploys a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. XVir is trained on a mix of sequencing reads coming from viral and human genomes, resulting in a model capable of robust detection of potentially mutated viral DNA across a range of experimental settings. Results on semi-experimental data demonstrate that XVir is able to achieve high classification accuracy, generally outperforming state-of-the-art competing methods. In particular, it retains high accuracy even when faced with diverse viral populations while being significantly faster to train than other large deep learning-based classifiers.

Similar Papers
  • Book Chapter
  • Cite Count Icon 9
  • 10.1007/978-3-319-56970-3_22
ABayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity
  • Jan 1, 2017
  • Soyeon Ahn + 1 more

RNA viruses replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, studying the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations. However, accurate reconstruction of viral quasispecies remains difficult due to limited read-lengths and presence of sequencing errors. The problem is particularly challenging when the strains in a population are highly similar, i.e., the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. In this paper, we present a novel viral quasispecies reconstruction algorithm, aBayesQR, that employs a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data. The search for the most likely quasispecies is conducted on long contigs that our method constructs from the set of short reads via agglomerative hierarchical clustering; operating on contigs rather than short reads enables identification of close strains in a population and provides computational tractability of the Bayesian method. Results on both simulated and real HIV-1 data demonstrate that the proposed algorithm generally outperforms state-of-the-art methods; aBayesQR particularly stands out when reconstructing a set of closely related viral strains (e.g., quasispecies characterized by low diversity).

  • Research Article
  • Cite Count Icon 37
  • 10.1038/embor.2009.61
The 30th anniversary of quasispecies
  • Apr 3, 2009
  • EMBO reports
  • Esteban Domingo + 1 more

The meeting on ‘Quasispecies: past, present and future’ took place between 17 and 18 November 2008, in Barcelona, Spain, and was organized by J. Gomez, C. Lopez‐Galindez, M.A. Martinez & A. Mas. ![][1] A meeting was held in Barcelona, Spain, in November 2008 to celebrate the 30th anniversary of the publication of the article that described the extensive genetic heterogeneity of bacteriophage Qβ (Domingo et al , 1978), which is considered to mark the beginning of experimental studies on viral quasispecies. This meeting was held at the impressive fifteenth century building of the ancient Hospital de la Santa Creu in the old town of Barcelona, which is now the headquarters of the Institut d'Estudis Catalans, and was attended by C. Weissmann (Jupiter, FL, USA), M. Billeter (Zurich, Switzerland) and E. Domingo (Madrid, Spain), who were three early protagonists of the phage Qβ work at the University of Zurich in the 1970s. Several speakers presented their results on the theoretical aspects of the population dynamics of cells and viruses, the clinical implications of quasispecies, and extensions of the quasispecies concept to cellular genes and prions. The meeting was introduced by A. Mas (Albacete, Spain), who reflected on the increasing impact that quasispecies have had in the scientific literature over the past three decades, and quoted some of the key references on viral quasispecies (Martell et al , 1992; Meyerhans et al , 1989; Najera et al , 1995; Vignuzzi et al , 2006; for a historical review of the impact of quasispecies in virology, see Holland, 2006). The scientific presentations were opened by Weissmann and Domingo, who were the last and first authors of the 1978 paper, respectively. Their talks conveyed the scientific atmosphere of the 1970s—when molecular biology was carried out with few recombinant‐DNA techniques—to a young audience. Nucleic‐acid sequencing was in … [1]: /embed/graphic-1.gif

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/bibm.2014.6999128
Quasispecies reconstruction based on vertex coloring algorithms
  • Nov 1, 2014
  • Diyue Bu + 1 more

The viral quasispecies represent a set of related variants in a virus population (e.g. from an infected patient) that contain similar mutations due to the rapid and mutation-prone replications in viruses. The characterization of viral quasispecies in a highly divergent virus population is of great interest in biomedical research, in particular, to identify virulent and drug-resistant mutations in viral genomes for diagnosis of infectious diseases and targeted drug design. In recent years, next-generation sequencing (NGS) techniques have been widely used for deep sequencing of virus populations, in an attempt to characterize low abundant viral quasispecies containing specific mutations associated with virulence or drug-resistance. However, because of the short length of NGS reads, it remains a challenge to reconstruct viral quasispecies from NGS sequencing data. In this paper, we formulate the viral quasispecies reconstruction as the vertex coloring problem on a read conflict graph, and then apply heuristic algorithms to solve it. We compared our new algorithms with one existing software tool on three simulated datasets for HIV quasispecies reconstruction. The results showed our methods can improve the accuracy on the inference of the identities and quantities of viral quansispecies in a virus population.

  • Research Article
  • Cite Count Icon 90
  • 10.1016/j.virusres.2016.09.016
Recent advances in inferring viral diversity from high-throughput sequencing data
  • Sep 28, 2016
  • Virus Research
  • Susana Posada-Cespedes + 2 more

Recent advances in inferring viral diversity from high-throughput sequencing data

  • Conference Article
  • 10.1109/bibm.2017.8218007
On the integration of assembly and non-assembly approaches for comparing biological sequences
  • Nov 1, 2017
  • Vi Dam + 1 more

As Next Generation Sequencing (NGS) technologies continue to expand rapidly, the need to assemble and manipulate NGS data, available in the form of short genomic reads, remains the primary source of biological data in many Bioinformatics applications. As a result, many assemblers have been developed to assemble NSG short reads into long genomic sequences or contigs ready for advanced analysis such as Whole Genome Wide Studies (GWAS). However, the lack of high levels of robustness and reproducibility continue to limit the impact of Bioinformatics research and many biomedical researchers remain skeptical of results obtained from bioinformatics applications. In this study, we conduct a comparative study of various widely used assemblers and compare their performances using several NGS datasets associated with various organisms. We highlight the advantages and disadvantage of each assembler and explore the factors that impact the performance of each approach. In addition, we survey the assembly-free compression approach recently developed to process NGS short reads to analyze their performance in comparing genomic sequences represented by sets of short reads. We use phylogeny trees obtained from simulated and real datasets to evaluate the accuracy of each assembly-free approach. We test the hypothesis that non-assembly approaches could potentially overcome the limitations and inaccuracies of assembly approaches in comparing sequences, especially for large read sizes. Moreover, we proposed a hybrid approach by integrating both assembly and non-assembly approach for classifying genomic sequences. The proposed approach incorporates results obtained from partially assembling short reads as input for assembly-free methods to complete the NGS manipulation process. Preliminary superior results show that the hybrid approach is potential in comparing genomic sequences.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 81
  • 10.1186/1471-2105-12-5
Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing
  • Jan 5, 2011
  • BMC Bioinformatics
  • Mattia Cf Prosperi + 8 more

BackgroundNext-generation sequencing (NGS) offers a unique opportunity for high-throughput genomics and has potential to replace Sanger sequencing in many fields, including de-novo sequencing, re-sequencing, meta-genomics, and characterisation of infectious pathogens, such as viral quasispecies. Although methodologies and software for whole genome assembly and genome variation analysis have been developed and refined for NGS data, reconstructing a viral quasispecies using NGS data remains a challenge. This application would be useful for analysing intra-host evolutionary pathways in relation to immune responses and antiretroviral therapy exposures. Here we introduce a set of formulae for the combinatorial analysis of a quasispecies, given a NGS re-sequencing experiment and an algorithm for quasispecies reconstruction. We require that sequenced fragments are aligned against a reference genome, and that the reference genome is partitioned into a set of sliding windows (amplicons). The reconstruction algorithm is based on combinations of multinomial distributions and is designed to minimise the reconstruction of false variants, called in-silico recombinants.ResultsThe reconstruction algorithm was applied to error-free simulated data and reconstructed a high percentage of true variants, even at a low genetic diversity, where the chance to obtain in-silico recombinants is high. Results on empirical NGS data from patients infected with hepatitis B virus, confirmed its ability to characterise different viral variants from distinct patients.ConclusionsThe combinatorial analysis provided a description of the difficulty to reconstruct a quasispecies, given a determined amplicon partition and a measure of population diversity. The reconstruction algorithm showed good performance both considering simulated data and real data, even in presence of sequencing errors.

  • Front Matter
  • Cite Count Icon 5
  • 10.1111/mec.16884
Long-read sequencing in ecology and evolution: Understanding how complex genetic and epigenetic variants shape biodiversity.
  • Mar 1, 2023
  • Molecular Ecology
  • Dan G Bock + 3 more

Long-read sequencing in ecology and evolution: Understanding how complex genetic and epigenetic variants shape biodiversity.

  • Research Article
  • Cite Count Icon 29
  • 10.1093/bioinformatics/bty291
Viral quasispecies reconstruction via tensor factorization with successive read removal
  • Jun 27, 2018
  • Bioinformatics
  • Soyeon Ahn + 2 more

MotivationAs RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains––a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.ResultsThis paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1–10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.Availability and implementationTenSQR is available at https://github.com/SoYeonA/TenSQR.Supplementary informationSupplementary data are available at Bioinformatics online.

  • Research Article
  • Cite Count Icon 8
  • 10.1093/bioinformatics/btt678
ISRNA: an integrative online toolkit for short reads from high-throughput sequencing data
  • Dec 3, 2013
  • Bioinformatics
  • Guan-Zheng Luo + 3 more

Integrative Short Reads NAvigator (ISRNA) is an online toolkit for analyzing high-throughput small RNA sequencing data. Besides the high-speed genome mapping function, ISRNA provides statistics for genomic location, length distribution and nucleotide composition bias analysis of sequence reads. Number of reads mapped to known microRNAs and other classes of short non-coding RNAs, coverage of short reads on genes, expression abundance of sequence reads as well as some other analysis functions are also supported. The versatile search functions enable users to select sequence reads according to their sub-sequences, expression abundance, genomic location, relationship to genes, etc. A specialized genome browser is integrated to visualize the genomic distribution of short reads. ISRNA also supports management and comparison among multiple datasets. ISRNA is implemented in Java/C++/Perl/MySQL and can be freely accessed at http://omicslab.genetics.ac.cn/ISRNA/.

  • Research Article
  • Cite Count Icon 204
  • 10.1073/pnas.052712599
Early changes in hepatitis C viral quasispecies during interferon therapy predict the therapeutic outcome.
  • Mar 5, 2002
  • Proceedings of the National Academy of Sciences
  • Patrizia Farci + 12 more

Despite recent treatment advances, the majority of patients with chronic hepatitis C fail to respond to antiviral therapy. Although the genetic basis for this resistance is unknown, accumulated evidence suggests that changes in the heterogeneous viral population (quasispecies) may be an important determinant of viral persistence and response to therapy. Sequences within hepatitis C virus (HCV) envelope 1 and envelope 2 genes, inclusive of the hypervariable region 1, were analyzed in parallel with the level of viral replication in serial serum samples obtained from 23 patients who exhibited different patterns of response to therapy and from untreated controls. Our study provides evidence that although the viral diversity before treatment does not predict the response to treatment, the early emergence and dominance of a single viral variant distinguishes patients who will have a sustained therapeutic response from those who subsequently will experience a breakthrough or relapse. A dramatic reduction in genetic diversity leading to an increasingly homogeneous viral population was a consistent feature associated with viral clearance in sustained responders and was independent of HCV genotype. The persistence of variants present before treatment in patients who fail to respond or who experience a breakthrough during therapy strongly suggests the preexistence of viral strains with inherent resistance to IFN. Thus, the study of the evolution of the HCV quasispecies provides prognostic information as early as the first 2 weeks after starting therapy and opens perspectives for elucidating the mechanisms of treatment failure in chronic hepatitis C.

  • Conference Article
  • 10.1109/allerton.2017.8262878
Viral quasispecies reconstruction via tensor factorization
  • Oct 1, 2017
  • Soyeon Ahn + 2 more

Viral quasispecies are heterogenous mixtures of viral strains generated as RNA viruses mutate and adapt to environmental changes. High-throughput DNA sequencing enables reconstruction of viral quasispecies and estimation of their abundances, thus providing information that assists in the development of effective antiviral drugs and vaccines. In this paper, sequencing data is represented by means of a binary tensor and the viral strains discovery is formulated as the tensor factorization problem. Performance of the proposed scheme is discussed. Results demonstrate effectiveness of the proposed algorithm.

  • Research Article
  • Cite Count Icon 1060
  • 10.1038/nature04388
Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population.
  • Dec 4, 2005
  • Nature
  • Marco Vignuzzi + 4 more

An RNA virus population does not consist of a single genotype; rather, it is an ensemble of related sequences, termed quasispecies. Quasispecies arise from rapid genomic evolution powered by the high mutation rate of RNA viral replication. Although a high mutation rate is dangerous for a virus because it results in nonviable individuals, it has been hypothesized that high mutation rates create a 'cloud' of potentially beneficial mutations at the population level, which afford the viral quasispecies a greater probability to evolve and adapt to new environments and challenges during infection. Mathematical models predict that viral quasispecies are not simply a collection of diverse mutants but a group of interactive variants, which together contribute to the characteristics of the population. According to this view, viral populations, rather than individual variants, are the target of evolutionary selection. Here we test this hypothesis by examining the consequences of limiting genomic diversity on viral populations. We find that poliovirus carrying a high-fidelity polymerase replicates at wild-type levels but generates less genomic diversity and is unable to adapt to adverse growth conditions. In infected animals, the reduced viral diversity leads to loss of neurotropism and an attenuated pathogenic phenotype. Notably, using chemical mutagenesis to expand quasispecies diversity of the high-fidelity virus before infection restores neurotropism and pathogenesis. Analysis of viruses isolated from brain provides direct evidence for complementation between members in the quasispecies, indicating that selection indeed occurs at the population level rather than on individual variants. Our study provides direct evidence for a fundamental prediction of the quasispecies theory and establishes a link between mutation rate, population dynamics and pathogenesis.

  • Research Article
  • Cite Count Icon 13
  • 10.1093/bioinformatics/btaa782
Inference of viral quasispecies with a paired de Bruijn graph
  • Sep 14, 2020
  • Bioinformatics
  • Borja Freire + 3 more

RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary data are available at Bioinformatics online.

  • Research Article
  • Cite Count Icon 89
  • 10.1016/j.jmb.2010.02.005
Unfinished Stories on Viral Quasispecies and Darwinian Views of Evolution
  • Feb 10, 2010
  • Journal of Molecular Biology
  • Antonio Más + 4 more

Unfinished Stories on Viral Quasispecies and Darwinian Views of Evolution

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/978-94-007-4899-6_2
Quasispecies Dynamics of RNA Viruses
  • Jan 1, 2012
  • Viruses: Essential Agents of Life
  • Miguel Angel Martínez + 5 more

RNA viruses, such as human immunodeficiency virus, hepatitis C virus, influenza virus, and poliovirus replicate with very high mutation rates and exhibit very high genetic diversity. The extremely high genetic diversity of RNA virus populations originates that they replicate as complex mutant spectra known as viral quasispecies. The quasispecies dynamics of RNA viruses are closely related to viral pathogenesis and disease, and antiviral treatment strategies. Over the past several decades, the quasispecies concept has been expanded to provide an adequate framework to explain complex behavior of RNA virus populations. Recently, the quasispecies concept has been used to study other complex biological systems, such as tumor cells, bacteria, and prions. Here, we focus on some questions regarding viral and theoretical quasispecies concepts, as well as more practical aspects connected to pathogenesis and resistance to antiviral treatments. A better knowledge of virus diversification and evolution may be critical in preventing and treating the spread of pathogenic viruses.

More from: Journal of Computational Biology
  • Research Article
  • 10.1089/cmb.2024.15655.rfs2023
Rosalind Franklin Society Proudly Announces the 2023 Award Recipient for Journal of Computational Biology
  • Sep 1, 2024
  • Journal of Computational Biology
  • Teresa M Przytycka

  • Research Article
  • 10.1089/cmb.2023.0198
Singular Value Decomposition-Based Penalized Multinomial Regression for Classifying Imbalanced Medulloblastoma Subgroups Using Methylation Data.
  • May 1, 2024
  • Journal of Computational Biology
  • Isra Mohammed + 2 more

  • Research Article
  • 10.1089/cmb.2023.0174
A Bayesian Change Point Model for Dynamic Alternative Transcription Start Site Usage During Cellular Differentiation.
  • May 1, 2024
  • Journal of Computational Biology
  • Juan Xia + 5 more

  • Research Article
  • Cite Count Icon 2
  • 10.1089/cmb.2023.0400
Orthology and Paralogy Relationships at Transcript Level.
  • Apr 1, 2024
  • Journal of Computational Biology
  • Wend Yam D.D Ouedraogo + 1 more

  • Open Access Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1089/cmb.2023.0149
MiCId GUI: The Graphical User Interface for MiCId, a Fast Microorganism Classification and Identification Workflow with Accurate Statistics and High Recall
  • Feb 1, 2024
  • Journal of Computational Biology
  • Aleksey Ogurtsov + 7 more

  • Research Article
  • Cite Count Icon 2
  • 10.1089/cmb.2023.0208
A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models.
  • Nov 1, 2023
  • Journal of Computational Biology
  • Rıza Özçelik + 5 more

  • Research Article
  • 10.1089/cmb.2023.29101.ht
Preface: RECOMB 2023 Special Issue
  • Nov 1, 2023
  • Journal of Computational Biology
  • Haixu Tang

  • Research Article
  • 10.1089/cmb.2023.0194
A Computational Software for Training Robust Drug-Target Affinity Prediction Models: pydebiaseddta.
  • Nov 1, 2023
  • Journal of Computational Biology
  • Melih Barsbey + 5 more

  • Research Article
  • Cite Count Icon 3
  • 10.1089/cmb.2023.0242
Shortest Hyperpaths in Directed Hypergraphs for Reaction Pathway Inference.
  • Oct 31, 2023
  • Journal of Computational Biology
  • Spencer Krieger + 1 more

  • Research Article
  • 10.1089/cmb.2023.0185
QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent.
  • Oct 30, 2023
  • Journal of Computational Biology
  • Yasamin Tabatabaee + 2 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon