Abstract

The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.

Highlights

  • Background & SummaryThe big data boom heralds a new era of precision medicine – access to large pools of ‘omics’ data has driven breakthroughs in this emerging field

  • The Gene Expression Omnibus (GEO) database at the National Center for Biotechnology Information (NCBI) was launched in 2000 to support public use of such genomic resources provided by the scientific communities[3,4]

  • The frozen Robust Multiarray Analysis (fRMA) method was chosen in this study for its use in the InSilico DB package[9] implemented in our developed framework

Read more

Summary

Background & Summary

The big data boom heralds a new era of precision medicine – access to large pools of ‘omics’ data has driven breakthroughs in this emerging field. 94,577 series probed with 18,138 platforms, for over 2 million samples have been submitted to the GEO database The challenge with these vast datasets, is that exploring a huge breadth of data is not straightforward – from effectively querying the correct dataset to utilizing the right pipelines for realizing true significance from such high-dimensional data. The merging of multiple genomic datasets into a single matrix for large-scale meta-analysis poses another source of variation termed the batch effect. Such bias arises as a consequence of systematic technical or non-biological differences between independent laboratories[10]. RNA-Seq data from the Cancer Genome Atlas (TCGA) in the present pipeline for multi-platform assessment and validation of differentially expressed genes This normalized dataset serves as an excellent large-scale ‘discovery cohort’ for identification of clinically relevant NSCLC biomarkers

Methods
Findings
Code Availability
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call