AMAS: a fast tool for alignment manipulation and computing of summary statistics.

Marek L Borowiec

doi:10.7717/peerj.1660

Abstract

The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, creation of replicate data sets, and removal of taxa. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It is computationally efficient, utilizes parallel processing, and performs better at concatenation than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules and needs no additional dependencies. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/ under GNU General Public License.

Highlights

The amount of data used in modern phylogenetics has increased dramatically since the advent of next-generation sequencing (McCormack et al, 2013)
Concatenation runs with FASconCATG were done in two modes: with the -i option that prevents the program from simultaneous calculation of alignment statistics for faster computing times, and with simultaneous writing of the statistics
Concatenation is a function that can be performed by two other popular programs that are used for alignment manipulations in phylogenomic data sets: FASconCAT-G, a Perl program (Kück & Longo, 2014) and Phyutility, written in Java (Smith & Dunn, 2008)

Summary

INTRODUCTION

The amount of data used in modern phylogenetics has increased dramatically since the advent of next-generation sequencing (McCormack et al, 2013). How to cite this article Borowiec (2016), AMAS: a fast tool for alignment manipulation and computing of summary statistics. Alignment summary statistics are needed for identification and filtering out ‘‘gappy’’ or fast-evolving data from downstream analyses such as the ones carried out in the studies cited above. Because the size of alignments used in phylogenetics is growing rapidly, there is a need for a fast and easy to use tool that can supplement existing phylogenomic pipelines. A number of freely available tools for manipulating alignments and computing their basic statistics exist, some of the most popular ones are based on graphical user interfaces (e.g., Mesquite: Maddison & Maddison, 2015) and not appropriate for command-line or scripted pipeline analyses. Phyutility and FASconCAT-G both allow for concatenation and the latter is capable of computing various alignment statistics. It is easy to install and use, requires only a standard distribution of Python 3 or newer, and is provided with a detailed instructions manual

METHODS

DISCUSSION