Errors in multiple sequence alignments (MSAs) are known to bias many comparative evolutionary methods. In the context of natural selection analyses, specifically codon evolutionary models, excessive rates of false positives result. A characteristic signature of error-driven findings is unrealistically high estimates of dN/dS (e.g., >100), affecting only a small fraction (e.g., ~0.1%) of the alignment. Despite the widespread use of codon models to assess alignment quality, their potential for error correction remains unexplored. We present BUSTED-E: a novel method designed to detect positive selection while concurrently identifying alignment errors. This method is a straightforward adaptation of the BUSTED flexible branch-site random effects model used to fit distributions of dN/dS, with an important modification: it integrates an "error-sink" component representing an abiological evolutionary regime (dN/dS > 100), and provides the option for masking errors in the MSA for downstream analyses. Statistical performance of BUSTED-E on data simulated without errors shows that there is a small loss of power, which can be mitigated by model averaged inference. Using four published empirical datasets, we show BUSTED-E reduces unrealistic rates of positive selection detection, often by an order of magnitude, and improves biological interpretability of results. BUSTED-E also detects errors that are largely distinct from other popular alignment cleaning tools (HMMCleaner and BMGE). Overall, BUSTED-E is a robust and scalable solution for improving the accuracy of evolutionary analyses in the presence of residual alignment errors, contributing to a more nuanced understanding of natural selection and adaptive evolution.
Read full abstract