Phylogenetic analysis of nucleotide and amino acid sequences requires the alignment of homologous sequences. The alignment procedure often requires the insertion of gaps, putatively corresponding to insertion or deletion events, which can be coded as phylogenetic characters. As a general class of phylogenetic characters, gaps have variously been suggested to be reliable (e.g., Lloyd and Calder, 1991; Van Dijk et al., 1999) or unreliable (e.g., Golenberg et al., 1993; Ford et al., 1995). This difference in opinion, coupled with the lack of a well-supported method for the coding of gaps, has led to a diversity of approaches by which gaps have been treated in, or excluded from, tree searches (Gonzalez, 1996). In an earlier paper we presented two methods, termed simple and complex indel coding, in which all gaps (excluding leading and trailing gaps, which are generally artifacts) can be coded from aligned sequence-based matrices (Simmons and Ochoterena, 2000). Simple indel coding, which is used in this study, is implemented by coding all gaps that have different 5' or 3' termini as separate presence/absence characters. Whenever a gap is being coded and the region it spans is completely included within the span of another gap, those sequences having the longer gap (i.e., one that extends to or beyond both the 5' and 3' termini of the gap being coded) are scored as inapplicable for the gap character being coded. Some have suggested on theoretical and empirical grounds that longer gaps are better phylogenetic characters than shorter gaps. Lloyd and Calder (1991) argued that multiresidue gaps are reliable phylogenetic characters because indels are unlikely to be repeated in the exact same position with the same length and sequence (for insertions); indels of different lengths at the same position are recognized as separate events. Similarly, van Ham et al. (1994) suggested that, based on the relative levels of homoplasy in the intergenic spacer between trnL and trnF, gaps longer than two positions are reliable phylogenetic characters. In this paper we assess the relative levels of homoplasy of gap and base characters from a selection of 38 published sequence-based matrices. We determine the potential phylogenetic information included in gap characters and the extent to which inclusion of gap characters alters the gene tree topology and branch support values. We also test the assertion that longer gaps are better phylogenetic characters than shorter gaps.
Read full abstract