HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences. To address this issue, we have implemented HAlign4 in C ++. In this version, we replaced the original suffix tree with Burrows-Wheeler Transform (BWT) and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million COVID-19 sequences in about 12 minutes and 300GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations. Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503. Supplementary data are available at Bioinformatics online.
Read full abstract