Abstract
BackgroundProtein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable.ResultsResults showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results.ConclusionsThese results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results.
Highlights
Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades
Results showed that (1) Esprit got the highest scores on all the datasets based on Silhouette Width (SW) calculation; (2) both Esprit and MUSCLE got high scores based on RS calculation, Esprit performed a little better than MUSCLE in total
Results showed that based on SW scores, Esprit performed better than other Multiple sequence alignment (MSA) methods used in this study in both RV11 and RV12 with SW scores 0.008933 and 0.107577, separately (See Fig. 2(a) for details)
Summary
Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. To test whether similar drawbacks influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Protein sequence alignments analyses become a crucial step for many bioinformatics analysis studies during the past decades.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.