Abstract
BackgroundSequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices. This paper deals with the problem of finding a representation that provides a comprehensive description of amino acid intrinsic properties consistent with the substitution matrices.ResultsWe present a Euclidian vector representation of the amino acids, obtained by the singular value decomposition of the substitution matrices. The substitution matrix entries correspond to the dot product of amino acid vectors. We apply this vector encoding to the study of the relative importance of various amino acid physicochemical properties upon the substitution matrices. We also characterize and compare the PAM and BLOSUM series substitution matrices.ConclusionsThis vector encoding introduces a Euclidian metric in the amino acid space, consistent with substitution matrices. Such a numerical description of the amino acid is useful when intrinsic properties of amino acids are necessary, for instance, building sequence profiles or finding consensus sequences, using machine learning algorithms such as Support Vector Machine and Neural Networks algorithms.
Highlights
Sequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices
The entries of the dot products matrix S of the “raw” non-centered vectors (Eq 3) systematically overestimate the corresponding entries of the original matrix. This difficulty increases with the matrix index, i.e., with decreasing evolutionary distance, as indicated by the quality index (Eq 6) that decreases from 93.6% for BLOSUM30, through 75.7% for
Using singular value decomposition (SVD), we were able to obtain from the substitution matrices a Euclidian vector for each amino acid
Summary
Sequence comparisons make use of a one-letter representation for amino acids, the necessary quantitative information being supplied by the substitution matrices. Methods for analyzing protein sequences rest on the underlying amino acid representation. For many purposes, such as sequence comparisons, amino acids are represented by a one-letter code and their similarity is “summed up” in substitution (scoring) matrices. Elements of these matrices represents the score of substituting an amino acid by another one in homologous proteins. Protein substitution matrices play a central role in sequence comparisons They permit to align and compare quantitatively any two protein sequences, but they do not provide a description of the individual amino acids themselves
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.