Knowledge driven decomposition of tumor expression profiles

Martin H Van Vliet,Lodewyk Fa Wessels,Marcel Jt Reinders

doi:10.1186/1471-2105-10-s1-s20

Abstract

BackgroundTumors have been hypothesized to be the result of a mixture of oncogenic events, some of which will be reflected in the gene expression of the tumor. Based on this hypothesis a variety of data-driven methods have been employed to decompose tumor expression profiles into component profiles, hypothetically linked to these events. Interpretation of the resulting data-driven components is often done by post-hoc comparison to, for instance, functional groupings of genes into gene sets. None of the data-driven methods allow the incorporation of that type of knowledge directly into the decomposition.ResultsWe present a linear model which uses knowledge driven, pre-defined components to perform the decomposition. We solve this decomposition model in a constrained linear least squares fashion. From a variety of options, a lasso-based solution to the model performs best in linking single gene perturbation data to mouse data. Moreover, we show the decomposition of expression profiles from human breast cancer samples into single gene perturbation profiles and gene sets that are linked to the hallmarks of cancer. For these breast cancer samples we were able to discern several links between clinical parameters, and the decomposition weights, providing new insights into the biology of these tumors. Lastly, we show that the order in which the Lasso regularization shrinks the weights, unveils consensus patterns within clinical subgroups of the breast cancer samples.ConclusionThe proposed lasso-based constrained least squares decomposition provides a stable and relevant relation between samples and knowledge-based components, and is thus a viable alternative to data-driven methods. In addition, the consensus order of component importance within clinical subgroups provides a better molecular characterization of the subtypes.

Highlights

Tumors have been hypothesized to be the result of a mixture of oncogenic events, some of which will be reflected in the gene expression of the tumor
We construct a C matrix, where each column consists of the classmeans of the five perturbation classes represented in the Human Mammary Epithelial Cell cultures (HMECs) samples (Myc, Ras, E2F3, Src, and BCatenin)
It is unlikely that a perturbation will have an effect on all genes, causing many genes to be irrelevant with respect to a specific perturbation, only contributing noise to the modeling problem

Summary

Introduction

Tumors have been hypothesized to be the result of a mixture of oncogenic events, some of which will be reflected in the gene expression of the tumor Based on this hypothesis a variety of data-driven methods have been employed to decompose tumor expression profiles into component profiles, hypothetically linked to these events. Teschendorff et al [8] have used Independent Component Analysis (ICA), and Principal Component Analysis (PCA), to decompose gene expression data from breast cancer samples These methods are purely data-driven, and have the disadvantage that they do not employ any prior knowledge. For this type of decomposition, the relation between the components is pre-defined, e.g. they are required to be orthogonal/independent. The choice of the number of components is typically based on the cumulative amount of variance explained by a set of components, which is largely arbitrary

Methods

Results

Conclusion