Abstract
In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
Highlights
Discovery of disease subtypes or of subgroups of patients sharing common characteristics is a challenging task in biomedical research, especially in the study of complex and heterogeneous genetic disorders
We show the performance of our procedure in the analysis of two data sets of patients with diffuse large B-cell lymphoma (DLBCL), one with breast cancer and one with medulloblastoma
In this work we presented the Compact-negative Matrix Factorization (NMF) procedure, which specially targets the factorization of high-dimensional data sets, providing greater quality of clustering results when compared to the direct application of NMF
Summary
Discovery of disease subtypes or of subgroups of patients sharing common characteristics is a challenging task in biomedical research, especially in the study of complex and heterogeneous genetic disorders. CN changes are defined as lesions in which the number of copies is different from two We can classify these CN aberrations into the following four categories: homozygous deletion (loss of two copies), heterozygous loss (loss of one copy), gain (number of copies equal to three or four), amplification (number of copies greater than four), see Table 1. The identification of these types of lesions is important, for example, in cancer studies. SNP microarrays are able to measure both the CN and the LOH at hundred thousands or even millions of SNPs along the genome [1]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.