Software to perform automated comparisons of pair-wise percent identities for microbial species

Sean Conlan,Lee Ann McCue

doi:10.2144/000112170

Abstract

BioTechniquesVol. 40, No. 5 BenchmarksOpen AccessSoftware to perform automated comparisons of pair-wise percent identities for microbial speciesSean Conlan & Lee Ann McCueSean Conlan*Address correspondence to Sean Conlan, Columbia University, 722 W. 168th Street, 1901H, New York, NY 10032, USA. e-mail: E-mail Address: identity@wadsworth.orgWadsworth, Center for Medical Science, Albany, NY, USASearch for more papers by this author & Lee Ann McCueWadsworth, Center for Medical Science, Albany, NY, USASearch for more papers by this authorPublished Online:21 May 2018https://doi.org/10.2144/000112170AboutSectionsView ArticleSupplemental MaterialPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack Citations ShareShare onFacebookTwitterLinkedInReddit View article The field of comparative genomics, which makes inferences about the properties of an organism through comparison of its genome to the genomes of related organisms, has seen rapid growth as a result of high-throughput sequencing initiatives. At the time of this study, there are almost 300 completely sequenced microbial genomes and over 500 partially sequenced microbial genomes. This wealth of genomic data means that, for a given species, there will likely be many related species with sequenced genomes available for comparison. This is useful, for example, when identifying transcription factor binding sites using phylogenetic footprinting. It has been shown that including closely related species is valuable in phylogenetic footprinting studies, because species within a close phylogenetic group are likely to share common regulatory mechanisms (1,2). Regulatory motif detection is complicated, however, by the sequence correlation present between related species. Here we define sequence correlation as similarity between orthologous sequences that is due to recent speciation rather than functional constraints. Therefore, it is useful to understand the level of correlation in the sequence data prior to initiating such a study. We present two programs (collect.identity.pl and analyze.identity.pl) that automate the tasks of performing pair-wise sequence alignments between sets of homologous sequences and generating summary statistics, as well as the data required for additional statistical analyses. This provides a way to compare a focused group of related genomes across a large number of homologous loci.The input data for collect.identity.pl are sets of homologous sequences. In our case, the sequences are intergenic regions, but any homologous data, including gene sequences, can be used. No tools for compiling or formatting sequences are provided, but methods to identify orthologous genes and extract corresponding gene or intergenic sequences are described elsewhere (3,4). In the example below, orthologous intergenic regions, greater than 50 nucleo-tides in length and no longer than 500 nucleotides in length, were extracted using a Basic Local Alignment Search Tool (BLAST)-based method described previously (5). Orthologous sequences are organized into FASTA-formatted files, one for each intergenic or gene sequence, with each file containing a set of orthologous sequences from multiple species. In order to facilitate the tracking of sequences and simplify input into the alignment programs, collect.identity.pl requires that the FASTA headers for sequences from a given species contain a user-defined identifier. Since the FASTA header associates a given sequence with an organism, sequences can occur in any order in the FASTA file, and individual files need not contain a sequence from each species.The program collect.identity.pl calls an external pair-wise aligner and aggregates the resulting percent identity information. Currently, three external pair-wise alignment programs are supported: Gap (Wisconsin Package; Accelrys Software, San Diego, CA, USA), Needle (6), and Water (6). In addition, the widely used multiple sequence alignment algorithm Clustal W (7) is also supported. The percent identities are collected from the alignment outputs and compiled into a file (identity.txt) that can be visualized using a statistical package (e.g., R; www.r-project.org) or a spreadsheet. In addition, the analyze.identity.pl program reads the identity.txt file and calculates the mean, median, standard deviation, interquartile range (IQR), and quartile skewness coefficient of the percent identity distributions for each species pair.Five γ-proteobacterial species, selected to span a range of phylogenetic distances, were compared, with the results shown in Figure 1. It is possible to examine both the central tendency of the data (e.g., median and mean) and the shape of the distribution from a boxplot or histogram. Percent identities are determined on a per locus basis and are reported without regard to the sequence length. In the case of the data used here, no strong correlation was found between sequence length and percent identity. It should be noted, however, that for random sequences of various fixed lengths, the shape (e.g., standard deviation and skew) of the percent identity distribution was found to be strongly dependent on sequence length (see the supplementary material). Therefore, we provide the program shuffle. fasta.pl, which will produce shuffled versions of the input data for analysis of sequence length and nucleotide composition dependencies (Figure 1, bottom). Furthermore, individual distributions can be analyzed to determine whether they are described by a single distribution or by multiple overlapping distributions that may reflect differing rates of evolution for subsets of sequences. Automatic detection of multimodal distributions is beyond the scope of this program, but distributions with unusual summary statistics should be investigated using histograms or quantile-quantile plots (see the supplementary material).Figure 1. Boxplot of percent identities.The top panel shows the percent identities for intergenic regions between five bacterial species. ECOL, Escherichia coli K12 (NC_000913); SFLE, Shigella fiexneri 2a str. 301 (NC_0O4337); SENT, Salmonella enterica Typhi CT18 (NC_003198); YKIM, Yersinia pestis KIM (NC_0O4088); SONE, Shewanella oneidensis MR-1 (NC_004347). Outliers described in the text are indicated by a star. The bottom panel shows the percent identities between species after randomly shuffling each sequence. Boxplots were made in R from the identity.txt file. Each box delineates the distance between the first and third quartiles and is bisected by the median value. Whiskers extend to the furthest data point, less that 3/2 the interquartile range from the end of the box. Outliers are shown as open Gircles.These software tools provide a way to compare sequence conservation between multiple genomes and extract information that can be used when designing and analyzing the results from a comparative genomics study. The statistics that describe the central tendency of a given percent identity distribution (i.e., mean and median) provide a measure of the overall relatedness of the two species. Such measures have provided valuable information in our phylogenetic footprinting studies, revealing sequence correlation that influenced motif prediction (1,2). In addition, it is useful to consider the shape of each distribution, because it is often necessary to account for sequence correlation that is not reflected by the overall level of conservation between two species. For instance, in our comparative study of several a-proteobacterial species, motif predictions found in anomalously conserved intergenic sequences between Rhodopseudomonas palustris and Bradyrhizobium japonicum were held to more stringent significance criteria (1).While we have found these tools to be valuable when planning and executing a phylogenetic footprinting study, we have also found that individual intergenic regions or genes of interest can be identified by analysis of the outlying data points. For instance, the two most extreme outliers in the Escherichia coli-Yersinia pestis comparison (Figure 1) correspond to the regions upstream of the rpsT and rpsG genes, which encode ribosomal proteins S20 and S7, respectively. It is known from mutational analyses that rpsT is regulated by a conserved RNA leader sequence encoded by its upstream intergenic region (8). Likewise, the S7 protein has been shown to regulate its own expression and the expression of neighboring genes by binding to the region upstream of rpsG (9). These functional constraints likely explain the abnormally high sequence conservation for these intergenic regions.A second example of anomalous conservation is seen in the comparison of E. coli and Shewanella oneidensis intergenic regions (Figure 1). The region upstream of the ylbG gene in E. coli is 76.8% identical to a region upstream of the predicted orthologous gene in S. oneidensis, significantly higher than the mean identity between this species pair (40.1%). The ylbG gene product is annotated as a putative regulator in E. coli. The orthologous gene in S. oneidensis, however, is annotated as a transposase within an insertion sequence (TSSol3). Alignment of the orthologous gene sequences and upstream intergenic regions showed that the entire region is conserved indicating that the ylbG gene is more likely the remnant of a transposable element. These two examples of interesting outliers suggest that the tools may have some value to researchers interested in horizontal gene transfer or identification of unannotated RNA leader sequences and small RNA genes.The software is written in Perl and is freely available, along with supplementary data, from bayesweb. wad sworth. org/pro kreg. html.AcknowledgmentsWe thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center, Andrew Reilly, and Lee Newberg for helpful discussions. This research was supported by Department of Energy grants DE-FG02-01ER63204 and DE-FG02-04ER63942 to L.A.M.Competing Interests StatementThe authors declare no competing interests.Supplementary dataTo view the supplementary data that accompany this paper please visit the journal website at: www.future-science.com/doi/suppl/10.2144/000112170

Full Text