Abstract

Gene selection is an attractive and important task in cancer survival analysis. Most existing supervised learning methods can only use the labeled biological data, while the censored data (weakly labeled data) far more than the labeled data are ignored in model building. Trying to utilize such information in the censored data, a semi-supervised learning framework (Cox-AFT model) combined with Cox proportional hazard (Cox) and accelerated failure time (AFT) model was used in cancer research, which has better performance than the single Cox or AFT model. This method, however, is easily affected by noise. To alleviate this problem, in this paper we combine the Cox-AFT model with self-paced learning (SPL) method to more effectively employ the information in the censored data in a self-learning way. SPL is a kind of reliable and stable learning mechanism, which is recently proposed for simulating the human learning process to help the AFT model automatically identify and include samples of high confidence into training, minimizing interference from high noise. Utilizing the SPL method produces two direct advantages: (1) The utilization of censored data is further promoted; (2) the noise delivered to the model is greatly decreased. The experimental results demonstrate the effectiveness of the proposed model compared to the traditional Cox-AFT model.

Highlights

  • Disease related gene selection has great potential in outcome prediction for cancer research

  • The lack of enough information in the labeled dataset tends to conduct the issue of the inaccuracy of prediction. Trying to solve this dilemma, the accelerated failure time (AFT) model is employed to estimate the true survival time for the censored data, and more disease information in the censored data can be delivered to the Cox model, which can help Cox model to produce better predictions

  • Step 4: The censored time point was decided in random selection, and the censored time y′i was computed as y′i = rand(1) ⁎ yi, we recorded the, where the yi is the true survival time, y′i is the observed time, Xi is the gene expression profile and δi represent the data is censored or not

Read more

Summary

Introduction

Disease related gene selection has great potential in outcome prediction for cancer research. The high dimension and low sample size of biological data greatly increase the difficulty of cancer survival analysis It is statistically challenging because the number of genes is far larger than that of the labeled samples. The lack of enough information in the labeled dataset tends to conduct the issue of the inaccuracy of prediction Trying to solve this dilemma, the AFT model is employed to estimate the true survival time for the censored data, and more disease information in the censored data can be delivered to the Cox model, which can help Cox model to produce better predictions. In the Cox-AFT model, the Cox model was used to classify the similar phenotype disease data into ‘low risk’ and ‘high risk’ subgroups, and these subgroups will be sent into the specific AFT model to get approximate estimate of survival time for the censored data. These pseudo labeled censored data will be fed into the Cox model as labeled data

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call