Abstract

e21016 Background: Advances in high throughput measurement technologies (-omics data) have made it possible to generate high complexity, high volume data for oncology research. Researchers are often confronted many more measurements than samples (p > > > n), which poses challenges for both modeling the complexity of disease at the molecular mechanism level, and overfitting when generating predictive models with complex data. Here, we applied a prior knowledge-driven approach to characterize and classify heavy versus light smokers with lung cancer from The Cancer Genome Atlas, an open source repository that catalogs, harmonizes and hosts -omics data collected from samples generously donated from cancer patients. Methods: We applied a reverse inferencing approach to systematically interrogate RNAseq measurements from tumor and control biopsies against a knowledgebase of directed gene networks curated from published experiments. If patterns observed in the data are significantly similar to those in a network, an inference about the directional activity of that network can be made; e.g., the increased transcriptional activity of NFKB. Our library was nucleated through an open sourced knowledge graph and enhanced with updated and relevant knowledge using the open sourced Biological Expression Language framework. Directed networks were either qualitatively scored and used to build disease models, or semi-quantitatively scored and used as classification features. Results: In LUAD tumors, we detected a pattern of gene signatures which indicated a tumor stem cell-like phenotype characterized by predicted decreases in the activity of pro-differentiation factors and an increased response to hypoxia. Analysis of patients with heavy ( > 40) versus light ( < 10) pack-year burden suggested an augmented dedifferentiation profile in heavy smokers. In this example, improved classification was observed through features compression through directed network scoring compared to using individual RNA measurements selected by filtration methods. Conclusions: In-silico analysis of lung cancer patient biopsies generated hypotheses implicating stem cell signaling in tumors, and a further stratification of this signal based on patient pack year burden. Mechanistic modeling may be a useful application to the overfitting problem often encountered with -omics data in translational studies. Data from other TCGA indications can be used to evaluate the consistency of this type of approach

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call