Abstract

Sequence comparison is an essential part of modern molecular biology research. In this study, we estimated the parameters of Markov chain by considering the frequencies of occurrence of the all possible amino acid pairs from each alignment-free protein sequence. These estimated Markov chain parameters were used to calculate similarity between two protein sequences based on a fuzzy integral algorithm. For validation, our result was compared with both alignment-based (ClustalW) and alignment-free methods on six benchmark datasets. The results indicate that our developed algorithm has a better clustering performance for protein sequence comparison.

Highlights

  • With the advent of the advanced sequencing techniques, researchers are generating a large number of protein sequences

  • For validation of our developed algorithm, we implemented our approach on NADH Dehydrogenase-5 protein sequences, NADH Dehydrogenase-6 protein sequences, xylanases protein sequences in the F10 and G11 datasets, transferrin protein sequences, coronavirus spike protein sequences and beta-globin protein sequences

  • The six benchmark datasets used in this study are as follows: (i) NADH Dehydrogenase 5 (ND 5) protein sequences. (ii) NADH Dehydrogenase 6 (ND 6) protein sequences. (iii) xylanases protein sequences in the F10 and G11 datasets. (iv) transferrin protein sequences. (v) coronavirus spike protein sequences. (vi) beta-globin protein sequences

Read more

Summary

Introduction

With the advent of the advanced sequencing techniques, researchers are generating a large number of protein sequences. Methods based on graphical representation, distance frequency matrix, numerical characterization, K-string dictionary etc., have been introduced to overcome the complication of the sequence alignment. K-string dictionary[17] approach permit users to use a much lower dimensional frequency or probability vector to represent a protein sequence. It significantly reduces the space requirement for their implementation. After getting the lower dimensional frequency vectors, Singular Value Decomposition (SVD) is used to get a better protein vector representation which helps user to obtain a precise phylogenetic tree These above mentioned methods are lagging behind in terms of accuracy. The main purpose of this study is to compare the performance among alignment-based and alignment-free protein clustering methods and to identify their strengths and weakness from the practical perspectives of the users

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call