Abstract

The application of machine learning models for prediction and prognosis of disease development has become an irrevocable part of cancer studies aimed at improving the subsequent therapy and management of patients. The application of machine learning models for accurate prediction of survival time in breast cancer on the basis of clinical data is the main objective of the presented study. The paper discusses an approach to the problem in which the main factor used to predict survival time is the originally developed tumor-integrated clinical feature, which combines tumor stage, tumor size, and age at diagnosis. Two datasets from corresponding breast cancer studies are united by applying a data integration approach based on horizontal and vertical integration by using proper document-oriented and graph databases which show good performance and no data losses. Aside from data normalization and classification, the applied machine learning methods provide promising results in terms of accuracy of survival time prediction. The analysis of our experiments shows an advantage of the linear Support Vector Regression, Lasso regression, Kernel Ridge regression, K-neighborhood regression, and Decision Tree regression—these models achieve most accurate survival prognosis results. The cross-validation for accuracy demonstrates best performance of the same models on the studied breast cancer data. As a support for the proposed approach, a Python-based workflow has been developed and the plans for its further improvement are finally discussed in the paper.

Highlights

  • In the last decade, high-throughput technologies have been massively used alongside clinical tests to study various diseases to decipher the underlying biological mechanisms and devise novel therapeutic strategies

  • We focus on a few machine learning (ML) techniques for analyzing an amount of existing count data such as Support Vector Regression, Kernel Ridge, K-neighborhood regression, Decision Tree, and Multi-layer perceptron regression

  • For the purposes of survival time prognosis, we normalize both datasets based on the Tumor-Integrated Clinical Feature (TICF) feature by removing the mean and scaling to unit variance

Read more

Summary

Introduction

High-throughput technologies have been massively used alongside clinical tests to study various diseases to decipher the underlying biological mechanisms and devise novel therapeutic strategies. The generated high-throughput data often correspond to measurements of different biological entities (e.g., gene expression, RNA transcripts, proteins), represent various views on the same entity (e.g., genetic, epigenetic), and are created through different technologies (e.g., microarrays, generation sequencing, etc.) [1,2]. It is still very difficult to distinguish tumors even by experts using modern methods such as immunohistochemistry, DNA, or RNA hybridization. There is an intensive and rapid development of new knowledge-based diagnostic methods for tumor detection with the extended use of tools of bioinformatics, computer science, statistics, and machine learning. Aside from that, many of these methods are difficult for integration and combination in a meaningful workflow. With the Information 2019, 10, 93; doi:10.3390/info10030093 www.mdpi.com/journal/information

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call