Abstract

BackgroundComputational identification of blood-secretory proteins, especially proteins with differentially expressed genes in diseased tissues, can provide highly useful information in linking transcriptomic data to proteomic studies for targeted disease biomarker discovery in serum.ResultsA new algorithm for prediction of blood-secretory proteins is presented using an information-retrieval technique, called manifold ranking. On a dataset containing 305 known blood-secretory human proteins and a large number of other proteins that are either not blood-secretory or unknown, the new method performs better than the previous published method, measured in terms of the area under the recall-precision curve (AUC). A key advantage of the presented method is that it does not explicitly require a negative training set, which could often be noisy or difficult to derive for most biological problems, hence making our method more applicable than classification-based data mining methods in general biological studies.ConclusionWe believe that our program will prove to be very useful to biomedical researchers who are interested in finding serum markers, especially when they have candidate proteins derived through transcriptomic or proteomic analyses of diseased tissues. A computer program is developed for prediction of blood-secretory proteins based on manifold ranking, which is accessible at our website http://csbl.bmb.uga.edu/publications/materials/qiliu/blood_secretory_protein.html.

Highlights

  • Computational identification of blood-secretory proteins, especially proteins with differentially expressed genes in diseased tissues, can provide highly useful information in linking transcriptomic data to proteomic studies for targeted disease biomarker discovery in serum

  • A computational framework for ranking blood-secretory proteins We present a computational framework for bloodsecretory protein prediction, consisting of the following steps as shown in Figure 1: (a) a pre-processing step is employed to filter out the most irrelevant proteins to the positive samples, based on the criteria described in subsection F; (b) a weighted graph is constructed as the main data structure for solving our ranking problem, based on the remaining proteins from (a). (c) This graph is sparsified with an efficient algorithm for further manifold ranking, which will be elaborated in subsection D; (d) a semisupervised ranking algorithm is applied on the constructed graph to rank the proteins; and (d) output the N highest ranked proteins, where N is a user-specified parameter

  • We used the same test set through the following evaluation procedure to assess the comparison performance: (1) We randomly selected 10, 20 and 30 blood-secretory proteins from the test dataset as the queries and rank all the 3,681 proteins in the test set using the positivesamples-only-based manifold ranking and support vector machine (SVM)-based algorithm

Read more

Summary

Introduction

Computational identification of blood-secretory proteins, especially proteins with differentially expressed genes in diseased tissues, can provide highly useful information in linking transcriptomic data to proteomic studies for targeted disease biomarker discovery in serum. Identification of disease markers in serum represents a very important problem, but it is rather challenging due to the composition complexity and the large dynamic range of proteins in human sera, which makes direct comparative analyses of serum proteomic data between diseased and control samples exceedingly difficult [1,2]. Tion sites, disordered regions, secondary structural content and hydrophobicity were identified, which can potentially distinguish blood-secretory from non-bloodsecretory proteins. Using these features, a classifier based on support vector machine (SVM) was trained to distinguish the blood-secretory proteins from non-bloodsecretory proteins. In our previous work [6], we have taken a rather conservative approach in selecting the negative dataset by leaving out a significant fraction of proteins which could potentially be non-blood secretory proteins; the data may not adequately represent the whole space of the non-blood-secretory proteins

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call