Abstract
BackgroundA standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences.ResultsHere, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including ‘core blocks’, ‘regions’ and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity.ConclusionsLEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Highlights
A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference
The alignments have a high proportion of sequences with ‘discrepancies’ that may correspond to naturally occurring variants or may be the result of artifacts, including proteins translated from partially sequenced genomes or ESTs, or badly predicted protein sequences
Sequence-level homology analysis To evaluate the accuracy of LEON-BIS for the detection of related and unrelated sequences, we constructed a large scale test set, based on the latest multiple alignments in the BAliBASE benchmark suite [23]
Summary
A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc These applications, sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. Multiple alignments of protein sequences are a fundamental tool in many areas of molecular biology, including phylogenetic studies, prediction of 2D/3D structure, or propagation of structural/functional information from annotated to non-annotated sequences. All of these applications rely on the identification of the conserved regions in the alignments, suggesting potential homologous relations between the sequences. Numerous column scores have been defined that attempt to distinguish the positions that are conserved in all the sequences from the unreliable positions, e.g. [4, 5].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.