GTRD: a database on gene transcription regulation-2019 update.

Ivan Yevshin,Yury Kondrakhin,Semyon Kolmykov,Ruslan Sharipov,Fedor Kolpakov

doi:10.1093/nar/gky1128

Abstract

The current version of the Gene Transcription Regulation Database (GTRD; http://gtrd.biouml.org) contains information about: (i) transcription factor binding sites (TFBSs) and transcription coactivators identified by ChIP-seq experiments for Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Arabidopsis thaliana; (ii) regions of open chromatin and TFBSs (DNase footprints) identified by DNase-seq; (iii) unmappable regions where TFBSs cannot be identified due to repeats; (iv) potential TFBSs for both human and mouse using position weight matrices from the HOCOMOCO database. Raw ChIP-seq and DNase-seq data were obtained from ENCODE and SRA, and uniformly processed. ChIP-seq peaks were called using four different methods: MACS, SISSRs, GEM and PICS. Moreover, peaks for the same factor and peak calling method, albeit using different experiment conditions (cell line, treatment, etc.), were merged into clusters. To reduce noise, such clusters for different peak calling methods were merged into meta-clusters; these were considered to be non-redundant TFBS sets. Moreover, extended quality control was applied to all ChIP-seq data. Web interface to access GTRD was developed using the BioUML platform. It provides browsing and displaying information, advanced search possibilities and an integrated genome browser.

Highlights

Metrics such as NRF, PBC1, PBC2, NSC, and RSC measure the quality of the alignment of reads to individual genomes
To estimate the quality of the products of the peak callers, the fractions of reads in the obtained peaks are analysed and metrics like FRiP and IDR are determined. These metrics do not enable the researcher to control the number of false positive and false negative peaks generated by different peak callers
We proposed two quality control metrics, namely FPCM (False Positive Control Metric) and FNCM (False Negative Control Metric), both of which are based on well-known and commonly used capture-recapture approaches, for example, in ecology to estimate the abundance of individuals of particular species, as well as the total number of species present in a given area

Summary

Introduction

ChlP-seq reads Reads aligned ChlP-seq peaks Clusters Meta-clusters The common practice to assess the quality of ChIP-seq datasets is to apply well-known quality metrics developed within the ENCODE project.

Results

Conclusion