Abstract
We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.
Highlights
Introduction and SummaryUnless humanity finds a cure, about a billion people alive today will die of cancer
Apart from applying *K-means to exome data, we perform out-of-sample stability analysis of our results here.) We use data consisting of 10,656 published exome samples aggregated by 32 cancer types listed in Table 1, which summarizes total occurrence counts, numbers of samples and data sources
We use: iter.max = 100 (this is the maximum number of iterations used in the built-in R function kmeans(); we note that there was not a single instance in our 30 million runs of kmeans() where more iterations were required – the R function kmeans() produces a warning if it does not converge within iter.max); num.try = 1000; and num.runs = 30,000
Summary
Unless humanity finds a cure, about a billion people alive today will die of cancer. Unlike other diseases, cancer occurs at the DNA level via somatic alterations in the genome. Considering that various signatures may be somatic mutational noise artifacts in the first instance and statistical error bars are large, it is natural to wonder whether there are some robust underlying clustering structures present in the data, with the understanding that such structures may not be present for all cancer types. Even if they are present for a substantial number of cancer types, unveiling them would amount to a major step forward in understanding cancer signature structure. We discuss how the input data (i.e., matrices of somatic mutation counts for cancer exome) are used in the context of *K-means in Section 3.2 (see [16] for technical details of *K-means)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have