Long-Read Metagenomics Improves the Recovery of Viral Diversity from Complex Natural Marine Samples.

Asier Zaragoza-Solas,Francisco Rodriguez-Valera,Jose M Haro-Moreno,Mario López-Pérez,Julie A Huber

doi:10.1128/msystems.00192-22

Asier Zaragoza-Solas, Francisco Rodriguez-Valera + Show 3 more

Open Access

https://doi.org/10.1128/msystems.00192-22

Copy DOI

Journal: mSystems	Publication Date: Jun 13, 2022
Citations: 13	License type: CC BY 4.0

Affiliation: Miguel Hernandez University

Abstract

ABSTRACTThe recovery of DNA from viromes is a major obstacle in the use of long-read sequencing to study their genomes. For this reason, the use of cellular metagenomes (>0.2-μm size range) emerges as an interesting complementary tool, since they contain large amounts of naturally amplified viral genomes from prelytic replication. We have applied second-generation (Illumina NextSeq; short reads) and third-generation (PacBio Sequel II; long reads) sequencing to compare the diversity and features of the viral community in a marine sample obtained from offshore waters of the western Mediterranean. We found that a major wedge of the expected marine viral diversity was directly recovered by the raw PacBio circular consensus sequencing (CCS) reads. More than 30,000 sequences were detected only in this data set, with no homologues in the long- and short-read assembly, and ca. 26,000 had no homologues in the large data set of the Global Ocean Virome 2 (GOV2), highlighting the information gap created by the assembly bias. At the level of complete viral genomes, the performance was similar in both approaches. However, the hybrid long- and short-read assembly provided the longest average length of the sequences and improved the host assignment. Although no novel major clades of viruses were found, there was an increase in the intraclade genomic diversity recovered by long reads that produced an enriched assessment of the real diversity and allowed the discovery of novel genes with biotechnological potential (e.g., endolysin genes).IMPORTANCE We explored the vast genetic diversity of environmental viruses by using a combination of cellular metagenome (as opposed to virome) sequencing using high-fidelity long-read sequences (in this case, PacBio CCS). This approach resulted in the recovery of a representative sample of the viral population, and it performed better (more phage contigs, larger average contig size) than Illumina sequencing applied to the same sample. By this approach, the many biases of assembly are avoided, as the CCS reads recovers (typically around 5 kb) complete genes and even operons, resulting in a better discovery of the viral gene diversity based on viral marker proteins. Thus, biotechnologically promising genes, such as endolysin genes, can be very efficiently searched with this approach. In addition, hybrid assembly produces more complete and longer contigs, which is particularly important for studying little-known viral groups such as the nucleocytoplasmic large DNA viruses (NCLDV).

Full Text