Recovering accuracy methods for scalable consistency library

Jordi Lladós,Fernando Cores,Cedric Notredame,Josep Lluís Lérida,Fernando Guirado

doi:10.1007/s11227-014-1362-z

Jordi Lladós, Fernando Cores + Show 3 more

Open Access

https://doi.org/10.1007/s11227-014-1362-z

Copy DOI

Journal: The Journal of Supercomputing	Publication Date: Dec 31, 2014
Citations: 13	License type: CC BY 4.0

Affiliation: University of Lleida

Abstract

Multiple sequence alignment (MSA) is crucial for high-throughput next generation sequencing applications. Large-scale alignments with thousands of sequences are necessary for these applications. However, the quality of the alignment of current MSA tools decreases sharply when the number of sequences grows to several thousand. This accuracy degradation can be mitigated using global consistency information as in the T-Coffee MSA-Tool, which implements a consistency library. However, consistency-based methods do not scale well because of the computational resources required to calculate and store the consistency information, which grows quadratically. In this paper, we propose an alternative method for building the consistency-library. To allow unlimited scalability, consistency information must be discarded to avoid exceeding the environment memory. Our first approach deals with the memory limitation by identifying the most important entries, which provide better consistency. This method is able to achieve scalability, although there is a negative impact on accuracy. The second proposal, aims to reduce this degradation of accuracy, with three different methods presented to attain a better alignment.

Highlights

Multiple sequence alignment (MSA) is a key-tool in several bioinformatic applications like protein/RNA structure prediction, phylogenetic analysis or pattern recognition
This is because the recovery data, present in the Dynamic substitution matrix (DSM) and Related consistency (RC) substitution matrices, takes into account information obtained in the library generation step that is directly related to the sequences that are being aligned
To improve the accuracy of the resulting alignments, we propose different methods to replace part of the consistency information lost due to the huge reduction in the library size

Summary

Introduction

Multiple sequence alignment (MSA) is a key-tool in several bioinformatic applications like protein/RNA structure prediction, phylogenetic analysis or pattern recognition. All current MSA tools have exhibited scalability issues when the number of sequences increases Among these problems we can highlight the inability to align so many sequences (lack of sufficient computational resources), the need for prohibitive execution times or a significant degradation in accuracy. Saté-II is a divide-and-conquer iterative meta-method that is applied to any existing external MSA method In spite of such improvements in the scalability/performance of these methods, some recent studies have shown that all of the main MSA packages only obtain good accuracy when they are applied to small-medium datasets (10–1,000 sequences). In this paper the authors use the MSA consistency-based tool T-Coffee [7] in large-scale alignments, and present an innovative solution to reduce the amount of memory needed to store the consistency.

T-Coffee MSA tool

T-Coffee scalability

T-Coffee optimized library

Scalable consistency library

Recovering accuracy

Experimental study

Library size and its impact on accuracy

Scalability study

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Recovering accuracy methods for scalable consistency library

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Similar Papers

Chapter 7 - Multiple sequence alignment tools – software and resources
Mohammad Yaseen Sofi ... Khalid Z Masoodi
Bioinformatics for Everyone | VOL. -
Mohammad Yaseen Sofi, et. al.Mohammad Yaseen Sofi ... Khalid Z Masoodi
17 Sep 2021
Bioinformatics for Everyone | VOL. -

Kalign 3: multiple sequence alignment of large data sets.
Timo Lassmann ... Anthony Mathelier
Computer applications in the biosciences : CABIOS | VOL. 36
Timo Lassmann, et. al.Timo Lassmann ... Anthony Mathelier
26 Oct 2019
Computer applications in the biosciences : CABIOS | VOL. 36

CSA: An efficient algorithm to improve circular DNA multiple alignment
Francisco Fernandes ... Luísa Pereira
BMC bioinformatics | VOL. 10
Francisco Fernandes, et. al.Francisco Fernandes ... Luísa Pereira
23 Jul 2009
BMC bioinformatics | VOL. 10

ProbCons: Probabilistic consistency-based multiple sequence alignment.
Chuong B Do ... Serafim Batzoglou
Genome research | VOL. 15
Chuong B Do, et. al.Chuong B Do ... Serafim Batzoglou
01 Feb 2005
Genome research | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Recovering accuracy methods for scalable consistency library

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing