Abstract High‐throughput sequencing has become commonplace in evolutionary studies. Large, rapidly collected genomic datasets are used to capture biodiversity and for monitoring global and national scale disease transmission patterns, among many other applications. Updating homologous sequence datasets with new samples is cumbersome, requiring excessive program runtimes and data processing. We describe Extensiphy, a bioinformatics tool to efficiently update multiple sequence alignments with whole‐genome short‐read data. Extensiphy performs reference based sequence assembly and alignment in one process while maintaining the alignment length of the original alignment. Input data‐types for Extensiphy are any multiple sequence alignment in fasta format and whole‐genome, short‐read fastq sequences. To validate Extensiphy, we compared its results to those produced by two other methods that construct whole‐genome scale multiple sequence alignments. We measured our comparisons by analysing program runtimes, base‐call accuracy, dataset retention in the presence of missing data and phylogenetic accuracy. We found that Extensiphy rapidly produces high‐quality updated sequence alignments while preventing alignment shrinkage due to missing data. Phylogenies estimated from alignments produced by Extensiphy show similar accuracy to other commonly used alignment construction methods. Extensiphy is suitable for updating large sequence alignments and is ideal for studies of biodiversity, ecology and epidemiological monitoring efforts.
Read full abstract