Abstract
Family identification is helpful for predicting protein functions. It has been known from the literature that longer sequences of base pairs or amino acids are required to study patterns in biological sequences. Since most protein sequences are relatively short, we randomly concatenate or link the protein sequences from the same family or superfamily together to form longer protein sequences. The 6-letter model, 12-letter model, 20-letter model, the revised Schneider and Wrede scale hydrophobicity, solvent accessibility and stochastic standard state accessibility are used to convert linked protein sequences into numerical sequences. Then multifractal analyses and wavelet analysis are performed on these numerical sequences. The parameters from these analyses can be used to construct parameter spaces where each linked protein is represented by a point. The four classes of proteins, namely the α, β, α + β and α/β classes, are then distinguished in these parameter spaces. The Fisher linear discriminant algorithm is used to assess the discriminant accuracy. Numerical results indicate that the discriminant accuracies are satisfactory in separating these classes. We find that the linked proteins from the same family or superfamily tend to group together and can be separated from other linked proteins. The methods are helpful for identifying the family of an unknown protein.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.