Abstract

We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.

Highlights

  • Introduction and SummaryUnless humanity finds a cure, about a billion people alive today will die of cancer

  • Apart from applying *K-means to exome data, we perform out-of-sample stability analysis of our results here.) We use data consisting of 10,656 published exome samples aggregated by 32 cancer types listed in Table 1, which summarizes total occurrence counts, numbers of samples and data sources

  • We use: iter.max = 100 (this is the maximum number of iterations used in the built-in R function kmeans(); we note that there was not a single instance in our 30 million runs of kmeans() where more iterations were required – the R function kmeans() produces a warning if it does not converge within iter.max); num.try = 1000; and num.runs = 30,000

Read more

Summary

Introduction and Summary

Unless humanity finds a cure, about a billion people alive today will die of cancer. Unlike other diseases, cancer occurs at the DNA level via somatic alterations in the genome. Considering that various signatures may be somatic mutational noise artifacts in the first instance and statistical error bars are large, it is natural to wonder whether there are some robust underlying clustering structures present in the data, with the understanding that such structures may not be present for all cancer types. Even if they are present for a substantial number of cancer types, unveiling them would amount to a major step forward in understanding cancer signature structure. We discuss how the input data (i.e., matrices of somatic mutation counts for cancer exome) are used in the context of *K-means in Section 3.2 (see [16] for technical details of *K-means)

Data Summary
Structure of the Data
Exome Data Results
Within-Cluster Correlations
Overall Correlations
Interpretation
Concluding Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call