Abstract

This study identifies potential biomarkers for osteoarthritis (OA) using publicly available proteome data. A systematic search of PubMed identified articles with information on proteins commonly found in cartilage extracellular matrix (ECM). Manual review of the articles yielded the following cartilage ECM proteins: COL2A1, ACAN, COL1A1, COL1A2, COL3A1, FN1, COMP, COL9A1, COL10A1, COL11A1, COL11A2, and COL14A1. The STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins database) identified ~1,000 “interacting proteins” (proteins that interacted with at least one of the above cartilage ECM proteins). Each of these interacting proteins were scored according to the number of cartilage ECM proteins with which they interact: range 1-11, mean of 2.48, and standard deviation of 2.44. The set of interacting proteins was subsequently compared with the Steinberg (PMID: 28827734) OA proteomics dataset. The Steinberg dataset identifies differentially abundant proteins when they quantified changes in protein levels in osteoarthritic joints compared to intact joints. Nearly 700 proteins were identified as both interacting and differentially abundant proteins. This set of interacting and differentially abundant proteins was analyzed using unsupervised machine learning (k-means cluster analysis) algorithms. The algorithm groups proteins into a specified number of clusters based on similarities of the input variables. To determine the number of clusters, the most important criteria utilized was domain knowledge of biological mechanisms. Additionally, two statistical criteria were used: the silhouette method, and the gap statistic (Figure 1). The selection process identified that 8 was the most appropriate value for k. Accordingly, the k-means cluster analysis grouped the proteins into 8 clusters based on (1) interactions (the number of interactions determined by STRING) and (2) differentially abundant (quantification of relative protein levels in osteoarthritic versus intact joints). Figure 2 shows that the cluster composed of proteins that interacted with 4 of the cartilage ECM proteins had the highest differential abundance score. The cluster of 11 proteins that scored the highest on both dimensions was selected for further analysis. This cluster is composed of the following proteins: COL1A1, COL1A2, MMP2, COL5A1, SPARC, LAMB1, BGLAP, POSTN, COL8A1, COL23A1, and P4HA3. To ascertain whether these 11 proteins are detectable in serum, the Human Plasma PeptideAtlas public serum proteome dataset was used. Analyses found POSTN, COL1A1, COL1A2, LAMB1 were present in high concentrations and may be promising, detectable biomarkers. Additionally, SPARC, COL5A1, MMP2, COL8A1, P4HA3, COL23A1, and BGLAP are all present in serum and may be reasonable biomarkers but have lower concentrations in serum. The results regarding MMP2, POSTN, COL1A1, and COL1A2 are consistent with that of Zhang (PMID: 33064574), who used machine learning to identify candidate biomarker genes from transcriptomic data. This study identifies multiple known and potential novel serum-based biomarkers for OA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call