OLOGRAM-MODL: mining enriched n-wise combinations of genomic features with Monte Carlo and dictionary learning.

Quentin Ferré,Cécile Capponi,Denis Puthier

doi:10.1093/nargab/lqab114

Abstract

Most epigenetic marks, such as Transcriptional Regulators or histone marks, are biological objects known to work together in n-wise complexes. A suitable way to infer such functional associations between them is to study the overlaps of the corresponding genomic regions. However, the problem of the statistical significance of n-wise overlaps of genomic features is seldom tackled, which prevent rigorous studies of n-wise interactions. We introduce OLOGRAM-MODL, which considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values. An optional machine learning method is proposed to find complexes of interest, using a new itemset mining algorithm based on dictionary learning which is resistant to noise inherent to biological assays. The overall approach is implemented through an easy-to-use CLI interface for workflow integration, and a visual tree-based representation of the results suited for explicability. The viability of the method is experimentally studied using both artificial and biological data. This approach is accessible through the command line interface of the pygtftk toolkit, available on Bioconda and from https://github.com/dputhier/pygtftk

Highlights

Modern genomic analysis methods can localize many different types of genomic features, such as histone modifications, transcriptional regulator binding sites or gene promoters
The Multiple Overlap Dictionary Learning (MODL) algorithm can be used to restrict the combinations for which this is calculated to combinations of interest
The first goal of the experiments presented is to validate our statistical model and our itemset mining algorithm

Summary

Introduction

Modern genomic analysis methods can localize many different types of genomic features, such as histone modifications, transcriptional regulator binding sites or gene promoters. A typical approach is to represent such features as regions, or intervals (as ‘Browser Extensible Data’ or BED files ) and look for significant co-localization through the statistical significance of the amount of overlap between them, against (H0) of overlapping no more than by chance. This is especially important since co-localization is often associated to functional association in genomic elements (1). Pairwise overlaps between two sets can be analyzed with methods such as GeometriCorr, BEDTOOLS fisher (2), GREAT, Genomic HyperBrowser (3), mostly available in the coloc-stats interface (4) Those methods are usually based on shuffles or on a statistical model. Pairwise overlaps are sometimes used to build association networks (8) but this can be misleading, as an association of a regulator A with B and of B with C does not necessarily mean A and C will be found in the same complex in real conditions

Methods

Results

Discussion

Conclusion