Abstract
PURPOSEThe analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis.METHODSWe introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations.RESULTSThe construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query–based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross–data set analysis to show the utility of the system.CONCLUSIONThe BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.
Highlights
Biological data produced by large-scale projects routinely reaches petabyte levels thanks to major advances in sequencing and imaging
The analysis of cancer biology data involves extremely heterogeneous datasets including information. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrative analysis
We introduce a graph database and query engine for discovery and analysis of cancer biology, called the BioMedical Evidence Graph (BMEG)
Summary
Biological data produced by large-scale projects routinely reaches petabyte levels thanks to major advances in sequencing and imaging. This exponential growth in size is well-documented and is being addressed by multiple big-data initiatives. The immense and expansive amount of heterogeneous data make it difficult to normalize and integrate data as well as perform integrative analysis across disparate experiments. When faced with these challenges as well as the substantial labor and computation costs, researchers may use only a fraction of publicly available data for their analysis, and will not update their data or analysis as new data becomes available
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.