Population size estimation for quality control of ChIP-Seq datasets.

Semyon K Kolmykov,Fedor A Kolpakov,Yury V Kondrakhin,Ivan S Yevshin,Ruslan N Sharipov,Anna S Ryabova,Li Chen

doi:10.1371/journal.pone.0221760

Abstract

Chromatin immunoprecipitation followed by sequencing, i.e. ChIP-Seq, is a widely used experimental technology for the identification of functional protein-DNA interactions. Nowadays, such databases as ENCODE, GTRD, ChIP-Atlas and ReMap systematically collect and annotate a large number of ChIP-Seq datasets. Comprehensive control of dataset quality is currently indispensable to select the most reliable data for further analysis. In addition to existing quality control metrics, we have developed two novel metrics that allow to control false positives and false negatives in ChIP-Seq datasets. For this purpose, we have adapted well-known population size estimate for determination of unknown number of genuine transcription factor binding regions. Determination of the proposed metrics was based on overlapping distinct binding sites derived from processing one ChIP-Seq experiment by different peak callers. Moreover, the metrics also can be useful for assessing quality of datasets obtained from processing distinct ChIP-Seq experiments by a given peak caller. We also have shown that these metrics appear to be useful not only for dataset selection but also for comparison of peak callers and identification of site motifs based on ChIP-Seq datasets. The developed algorithm for determination of the false positive control metric and false negative control metric for ChIP-Seq datasets was implemented as a plugin for a BioUML platform: https://ict.biouml.org/bioumlweb/chipseq_analysis.html.

Highlights

Understanding the basic mechanisms of transcription regulation remains to be the great challenge in modern biology
In this study we developed two novel metrics: False Positive Control Metrics (FPCM) and False Negative Control Metrics (FNCM), which allow to control false positive (FP) and false negative (FN) rates of peak callers for assessment of quality of TF binding regions (TFBRs) datasets
The absence of input control resulted in increase of FP rate and decrease of FN rates of the peak callers MACS, PICS, and SISSRs

Summary

Introduction

Understanding the basic mechanisms of transcription regulation remains to be the great challenge in modern biology. Regulation of transcription is a complex process in which transcription factors (TFs) play the key role. TFs recognize and bind with corresponding TF binding sites (TFBSs) in the genome. The in silico recognition of those TFBSs in whole genomes has been staying one of the most complex problems in bioinformatics. Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) is a widely used experimental technology for the identification of TF binding regions (TFBRs) containing TFBSs. For tens of thousands of ChIP-Seq experiments have been conducted. It is reasonable to assume that this number will increase rapidly year by year

Objectives

Methods

Results

Conclusion