Protein sequence similarity analysis (PSSA) is a significant task in bioinformatics, which can obtain information about unknown sequences such as protein structures and homology relationships. Protein sequence refers to the series of amino acids with rich physical and chemical properties, namely the basic structure of proteins. However, sequence similarity analysis and phylogenetic analysis between different species which have complex amino acid sequences is a challenging problem. In this paper, nine properties of amino acids were considered and the sequence was converted into numerical values by principal component analysis (PCA); with Haar Wavelet Transform, and Higuchi fractal dimension (HFD), a new feature vector is constructed to represent the sequence; Spearman distance was selected to calculate the distance matrix and the phylogenetic tree was constructed. In this paper, two representative protein sequences (9 ND5 (NADH dehydrogenase 5) and 8 ND6 (NADH dehydrogenase 6)) were selected for similarity analysis and phylogenetic analysis, and compared with MEGA software and other existing methods. The extensive results show that our method is outperforming and results consistent with the known facts.
Read full abstract