Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Joel Armstrong,Jeremy Johnson,Robert S Harris,Jessica Alföldi,Guojie Zhang,Ian T Fiddes,Alden Deran,Glenn Hickey,David Haussler,Shaohong Feng,Diane P Genereux ,Josefin Stiller,Adam M Novak,Elinor K Karlsson ,Kerstin Lindblad‐Toh ,Voichita D Marinescu ,Erich D Jarvis,Duo Xie,Qi Fang,Mark Diekhans,Benedict Paten

doi:10.1038/s41586-020-2871-y

Abstract

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.

Highlights

The version of Cactus available in 2012 performed very well in the Alignathon[14], an evaluation of genome aligners
Progressive aligners use a ‘guide tree’ to recursively break a multiple alignment problem into many smaller sub-alignments, each of which is solved independently; the resulting sub-alignments are themselves aligned together according to the tree structure to create the final alignment
Progressive Cactus produces alignments with higher accuracy for both simulated primate (F1 score of 0.989) and mammal (F1 score of 0.795) clades than any aligner that participated in the Alignathon (Supplementary Tables 1, 2), including the original version of Cactus

Summary

Evaluation on simulated data

The Alignathon simulated datasets[14] have been aligned with many competing genome aligners and have a known truth set, providing a way to compare Progressive Cactus against other genome aligners. Comparison of Cactus and LASTZ coding sequence mappings to the union of the translated alignments, both in terms of individual gene counts and coding and mRNA bases, showed that Cactus has a marginally higher fraction of shared elements with the translated alignments than LASTZ (Supplementary Table 9) Supporting this result, comparing the median per-transcript and per-gene base-level Jaccard similarity of these mappings to chicken, while Progressive Cactus and LASTZ were most similar, Progressive Cactus was more similar to translated BLAT and Blast than LASTZ was Unlike Progressive Cactus, MULTIZ is reference-biased, the difference is starker when looking at the number of bases aligned to a genome not used as the MULTIZ reference (an average of 79% of the zebra finch covered versus 49.2%, for an average increase of 367 Mb) (Fig. 4)

Discussion

Methods

Code availability

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature	Publication Date: Nov 11, 2020
Citations: 309	License type: open-access

R Discovery Prime

R Discovery Prime

Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature

Lead the way for us

Similar Papers

Novel Computational Methods for Large Scale Genome Comparison
Todd J. Treangen ... Xavier Messeguer
-
Todd J. Treangen, et. al.Todd J. Treangen ... Xavier Messeguer
01 Jan 2009
01 Jan 2009

A prototype for multiple whole genome alignment
J.S Deogun ... Jingyi Yang
-
J.S Deogun, et. al.J.S Deogun ... Jingyi Yang
01 Jan 2003
01 Jan 2003

ACMGA: a reference-free multiple-genome alignment pipeline for plant species
Huafeng Zhou ... Baoxing Song
BMC Genomics | VOL. 25
Huafeng Zhou, et. al.Huafeng Zhou ... Baoxing Song
25 May 2024
BMC Genomics | VOL. 25

Multiple Genome Alignment by Clustering Pairwise Matches
Jeong-Hyeon Choi ... Hwan-Gue Cho
-
Jeong-Hyeon Choi, et. al.Jeong-Hyeon Choi ... Hwan-Gue Cho
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Progressive Cactus is a multiple-genome aligner for the thousand-genome era

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Nature