Abstract

Multiple sequence alignment (MSA) is crucial for high-throughput next generation sequencing applications. Large-scale alignments with thousands of sequences are necessary for these applications. However, the quality of the alignment of current MSA tools decreases sharply when the number of sequences grows to several thousand. This accuracy degradation can be mitigated using global consistency information as in the T-Coffee MSA-Tool, which implements a consistency library. However, consistency-based methods do not scale well because of the computational resources required to calculate and store the consistency information, which grows quadratically. In this paper, we propose an alternative method for building the consistency-library. To allow unlimited scalability, consistency information must be discarded to avoid exceeding the environment memory. Our first approach deals with the memory limitation by identifying the most important entries, which provide better consistency. This method is able to achieve scalability, although there is a negative impact on accuracy. The second proposal, aims to reduce this degradation of accuracy, with three different methods presented to attain a better alignment.

Highlights

  • Multiple sequence alignment (MSA) is a key-tool in several bioinformatic applications like protein/RNA structure prediction, phylogenetic analysis or pattern recognition

  • This is because the recovery data, present in the Dynamic substitution matrix (DSM) and Related consistency (RC) substitution matrices, takes into account information obtained in the library generation step that is directly related to the sequences that are being aligned

  • To improve the accuracy of the resulting alignments, we propose different methods to replace part of the consistency information lost due to the huge reduction in the library size

Read more

Summary

Introduction

Multiple sequence alignment (MSA) is a key-tool in several bioinformatic applications like protein/RNA structure prediction, phylogenetic analysis or pattern recognition. All current MSA tools have exhibited scalability issues when the number of sequences increases Among these problems we can highlight the inability to align so many sequences (lack of sufficient computational resources), the need for prohibitive execution times or a significant degradation in accuracy. Saté-II is a divide-and-conquer iterative meta-method that is applied to any existing external MSA method In spite of such improvements in the scalability/performance of these methods, some recent studies have shown that all of the main MSA packages only obtain good accuracy when they are applied to small-medium datasets (10–1,000 sequences). In this paper the authors use the MSA consistency-based tool T-Coffee [7] in large-scale alignments, and present an innovative solution to reduce the amount of memory needed to store the consistency.

T-Coffee MSA tool
T-Coffee scalability
T-Coffee optimized library
Scalable consistency library
Recovering accuracy
Experimental study
Library size and its impact on accuracy
Scalability study
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call