Abstract Making data and materials used in scientific research available to others is not only a prerequisite for enabling reproducible science and maximizing the return on funding, it is also the very basis of scientific progress, allowing other scientists to build on the previous work of their colleagues (1). In life sciences this principle has been well established since the 1980s, when scientific journals started making it a requirement, that X-ray crystallography and DNA sequence data supporting publications must be deposited in appropriate databases. In 1996, as a part of the Human Genome Project, it was agreed that all the sequence data would be released in publicly accessible databases within twenty-four hours after generation, known as Bermuda principles (2). With advent of new types of data, such as microarray data, it was soon recognized that not only the sequence or microarray data are important, but also the standards how these data are represented and the information about the samples and experiments (3). However, to enable data sharing, a properly built and funded infrastructure is needed (4). Without public data resources, such as Ensembl, UniProt, and Expression Atlas, that add value to molecular data archives, modern life sciences research would be hard to imagine (5-7). The data-sharing mentality is now so firmly embedded in the ethos of life sciences that scientists working in the field struggle to imagine that in other science disciplines the mentality may be different. Data sharing in medical research is a more complex and difficult problem due to multiple reasons, including the need for data security and the patient's confidentiality, complexity of representing health records, and diversity of national legislations. However, there is also an increasing realization that sharing biomedical research data, which is now facilitated by the use of electronic health records, can be an important accelerator of biomedical research (7). Cancer researchers are at the forefront of the data-sharing approach in biomedical research. The International Cancer Genome Consortium is completing the sequencing of genomes, transcriptomes, and epigenomes of over 20,000 patients, making all the data alongside the essential clinical information available to researchers (8). About 10% of these genomes and transcriptomes have been reanalyzed in a standardized way by the Pan-cancer Analysis of the Whole Genomes (PCAWG) group. The PCAWG project provides an important demonstration how data integration can accelerate biomedical research. In this talk I will particularly concentrate on lessons learned from integration of genome, transcriptome, and clinical data of this project. Although at the time of writing this abstract, some of the PCAWG transcriptome analysis is still being finalized, it is already clear that integrating heterogeneous types of data from heterogeneous cancer types provides new insights about this disease, and also that such integrative analysis is challenging (9-13). I will also describe some of the experience in building infrastructure for data integration and sharing at the European Bioinformatics Institute (EMBL-EBI) and some of the data resources provided by EMBL-EBI (14,15), particularly concentrating on cancer genomics data and the benefits that data sharing brings to cancer research.
Read full abstract