Abstract
BackgroundNumerous publications attempt to predict cancer survival outcome from gene expression data using machine-learning methods. A direct comparison of these works is challenging for the following reasons: (1) inconsistent measures used to evaluate the performance of different models, and (2) incomplete specification of critical stages in the process of knowledge discovery. There is a need for a platform that would allow researchers to replicate previous works and to test the impact of changes in the knowledge discovery process on the accuracy of the induced models.ResultsWe developed the PCM-SABRE platform, which supports the entire knowledge discovery process for cancer outcome analysis. PCM-SABRE was developed using KNIME. By using PCM-SABRE to reproduce the results of previously published works on breast cancer survival, we define a baseline for evaluating future attempts to predict cancer outcome with machine learning. We used PCM-SABRE to replicate previous work that describe predictive models of breast cancer recurrence, and tested the performance of all possible combinations of feature selection methods and data mining algorithms that was used in either of the works. We reconstructed the work of Chou et al. observing similar trends – superior performance of Probabilistic Neural Network (PNN) and logistic regression (LR) algorithms and inconclusive impact of feature pre-selection with the decision tree algorithm on subsequent analysis.ConclusionsPCM-SABRE is a software tool that provides an intuitive environment for rapid development of predictive models in cancer precision medicine.
Highlights
Numerous publications attempt to predict cancer survival outcome from gene expression data using machine-learning methods
We use KNIME rather than the original software (Clementine 10.1) and we use as input data a more current compendium of expression data [7]
We reconstructed the work of Chou et al observing the superior performance of Probabilistic Neural Network (PNN) and logistic regression (LR) over Decision tree (DT), but the impact of feature pre-selection with the DT algorithm on subsequent algorithm was inconclusive
Summary
We developed PCM-SABRE (available as Additional file 1) as a software system that allows to compare and improve expression-based predictive models of cancer patients. We used PCM-SABRE to replicate previous work that describe predictive models of breast cancer recurrence, and evaluated the performance of all possible combinations of feature selection methods and data mining algorithms that was used in either of the works. A preprocessing step was added that reproduces the preprocessing performed in the original paper This step was conducted with a specialized R script written for this purpose. In contrast to the original work, PCM-SABRE reports that LR has the best performance. Both show a different trend when adding the DT feature selection methods. It is worth noting that the estimated accuracy reported by PCM-SABRE is higher than in the original work This may be because a different dataset was used for the analysis. RF performed better combined with the ANOVA feature selection method and achieved the highest Accuracy (77.70%)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have