The emergence of high throughput technologies with the production of Gigabyte omics datasets has led to revolutionary changes in molecular biology and functional genomics. Despite the incorporation of increasingly quantitative technologies, the field suffers from important reproducibility problems. Some causes have been identified: they include poor quality management, competition for publishing, funding and jobs, problems in experimental and statistical design of assays. The consequences are - among others - delays in the implementation of efficient and specific anti-cancer treatments, the unnecessary duplication/validation of improperly conducted studies, and the waste of public funding. Here we wish to discuss another cause of poor reproducibility, which will become increasingly important with the advent of personalized medicine: the generation of poor quality datasets from Next Generation Sequencing (NGS) technologies, specifically those that involve enrichment assays like ChIP-sequencing. Today NGS-derived applications are becoming increasingly popular, which is further supported by decreasing sequencing costs, the rapid development of novel sequencing-based technologies, and the power of genome-wide data interpretation by functional genomics and systems biology approaches. However, the complexity and sensitivity of these technologies bear the risk of introducing various types of bias. Thus, it is rather surprising that only very few quality indicators have been developed to date. The public availability of omics data in large repositories, such as GEO, is no doubt an enormously valuable source. However, by working extensively with such datasets, we realized that the lack of universal quality control indicators in publications and data repositories seriously limits the use of existing data and can contribute to irreproducibility issues. Here we provide examples that illustrate the problems generated by the use of poor quality datasets and propose solutions that would ultimately enhance reproducibility, encourage scientists to use existing datasets in the design and interpretation of their own research projects. Our goal is to increase awareness about the need of linking quality assessment to datasets in the scientific community, and to initiate a discussion on the quality control of big data.
Read full abstract