Abstract

Abstract Integrative analysis of diverse high-dimensional molecular, histopathological and clinical data provides an effective way to identify biologically and clinically relevant subclasses across multi-level data. However, unbiased integrative methods to identify distinctive features and group structure in such data remains problematic. We propose a machine learning clustering technique based on random forest methods that enables unbiased integration. Using a permutation-based framework for the tree construction procedure and measuring of feature importance, robust and pure clusters can be produced. The performance of standard, regularised, and conditional inference random forest methods was evaluated using the adjusted Rand index, the Calinski-Harabasz index, and cluster and feature purity. In simulations studies, random forest clustering techniques were able to identify clusters of high purity. Using datasets from the UCI Machine Learning Repository as a proof of concept, all three techniques were able to identify clusters in mixed data, whereby the conditional inference method produced clusters with the highest feature purity. Next, we applied our clustering techniques to high-dimensional data obtained from two independent breast cancer studies: (i) International Cancer Genome Consortium (ICGC), consisting of 560 cases and 147 features, and (ii) Sweden Cancerome Analysis Network - Breast (SCAN-B), incorporating 241 cases and 53 features. Features included rearrangement and mutational signatures, somatic mutation in cancer drivers, germline mutations in BRCA1/2, genomic instability measures, intrinsic molecular breast cancer subtypes, and clinico-pathological characteristics. Despite dissimilarities in the breast cancer subtype composition between these two datasets, the conditional inference random forest method was able to identify concordant subgroups between the studies supported by molecular and histopathological characteristics. Moreover, novel relationships amongst molecular features with potential clinical relevance were revealed. For example, one cluster was enriched for BRCA2-deficient breast cancer cases with MYC amplifications, while another predominantly consisted of non-basal-like triple-negative breast cancers with PIK3CA mutations. Together, these results support the use of our machine learning clustering technique based on random forest methods to identify robust and biologically relevant group structures using complex high-dimensional mixed data. Citation Format: Jelmar Quist, Lawson Taylor, Johan Staaf, Anita Grigoriadis. Application of random forest machine learning techniques on mixed data from breast cancer studies [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2108.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.