Abstract

Analysing breast cancer data is a long-established research topic from both medical diagnosis and data modeling perspectives. Enormous predictive models have been employed in modeling breast cancer data, e.g., predicting a patient's survival rate given certain medical circumstances and a patient's demographics. However, these predictive models tend to take a black-box approach to the modeling and therefore can hardly provide any explainable results to be applied for diagnostic purposes, in particular, if neural network-based models are utilised. On the other hand, identifying diagnostically influential factors with exploratory descriptive models has been proven difficult due to the high dimensionality of breast cancer data under consideration. For instance, the breast cancer data provided by SEER, The Surveillance, Epidemiology, and End Results Program, typically has more than 100 dimensions of numeric and categorical data types and could expend to about some 1,000 dimensions for analysis if orthogonal (one-hot) encoding is applied. Hence, effectively interpreting and understanding high-dimensional data becomes crucial in modelling cancer data, and it is because of this that dimensionality reduction algorithms and manifold learning algorithms have been studied intensively and many relevant algorithms are available, with each having pros and cons of its own. In this chapter, a comparative study is presented aiming at providing visualized, explainable insights in breast cancer survival rate analysis and identifying critical influential factors that strongly determine the likelihood of a patient's survival. Two dimensionality reduction algorithms are considered in this study for comparison purpose: one is a typical and popular t-distributed stochastic neighbor embedding (t-SNE) algorithm and another is a relevant new same degree distribution (SDD) algorithm. The relevant experiments have demonstrated that based on the same embedding performance assessment metrics, the SDD algorithm can achieve much better data embedding results which could be impossible or difficult if t-SNE is used. Furthermore, using the reliable embedding results from SDD, meaningful and explainable factors have been identified that reflect crucially the similarities of the patients who have survived and the diversities of the patients who, unfortunately, have died. Clusters of patients who survived are clearly recognizable in a two-dimensional embedding space, whereas the embedded points of patients who died are significantly scattered in the space. The entire package of the codes used for the analysis is available for replication.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call