Abstract

Putative protein sequences decoded from the messenger ribonucleic acid (mRNA) sequences are composed of twenty amino acids with different physical-chemical properties, such as hydrophobicity and hydrophilicity (uncharged, positively charged or negatively charged amino acids). In this paper, the power spectral estimate (PSE) technique for random processes is applied to the protein sequence matching framework. First, the twenty kinds of amino acids are classified based on their hydrophobicity and hydrophilicity. Then each amino acid in the protein sequence is mapped to a corresponding complex value. Consider the various Hidden Markov chain orders in the complex valued sequences. The PSE method can explore the implicit statistical relations among protein sequences. The mean squared error between the power spectra of two sequences is determined and then used to measure their similarity. The experimental results verify that the proposed PSE method provides the consistent similarity measurement with the well-known ClustalW and BLASTp schemes. Moreover, the proposed PSE can show better similarity relevance than ClustalW and BLASTp schemes.

Highlights

  • In the past two decades, deoxyribonucleic acid (DNA) and protein sequences in various organisms have been massively obtained with the help of high-throughput sequencing technologies [1]

  • We proposed a new comparative tool for protein sequence comparison utilizing the parametric spectral estimate in stochastic processes to analyze protein sequences

  • The experimental results show that the proposed methods effectively achieved the consistent comparison results with the well-known ClustalW and BLASTp

Read more

Summary

Introduction

In the past two decades, deoxyribonucleic acid (DNA) and protein sequences in various organisms have been massively obtained with the help of high-throughput sequencing technologies [1]. Biologists unravel the functionality and capability of numerous protein sequence domains by understanding their 3-D structures obtained by the x-ray diffraction technique or NMR technology. These procedures require laborious preparations of protein crystals and are extremely time-consuming. (2) Geometrical methods [3], which apply graphs to represent the sequences and analyze them. Two types of methods are commonly used to analyze the protein sequences and predict their functions : (1) Statistical methods [2], which apply the well-known mathematical models in stochastic processes to analyze the sequences. Both types of methods first transform the symbolic amino acids to numerical values. High similarity between two sequences may infer two meanings: (1) the two sequences could be homologous; (2) the protein structures and/or their biological functions are similar

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call