Abstract
3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins.
Highlights
The analysis of protein structures using the distance matrix of Cα atoms has a long history in structural biology
The protein distance matrix contains the distances between residues, which can be represented as a grayscale image, where the distances between pairs of Cα-atoms are represented by intensity
A less strong correlation is observed between domain size and the ratio of unique words in the repeat protein structures (R = -0.63), compared to the domains that are not part of the RepeatsDB database (R = -0.80), see S2 Fig. Overall, these results suggest that the ratio of unique words of solenoid and tandem repeat proteins is shifted towards lower ratios
Summary
The analysis of protein structures using the distance matrix of Cα atoms has a long history in structural biology. A protein distance matrix has been used for structural alignment, protein classification, and finding homologous proteins [1, 2]. Tremendous progress has been made in predicting 3D proteins based on distance matrix and artificial intelligence [3,4,5,6]. Various studies have shown that the representation of protein structure in 2D space has the following main advantages: it represents local, short-, medium-, and long-range contacts between Cα-atoms simultaneously and is rotation and translation invariant [7]. The protein distance matrix contains the distances between residues, which can be represented as a grayscale image, where the distances between pairs of Cα-atoms are represented by intensity. Feature extraction can be applied to obtain points of interest
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.