THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

Ankit Agrawal,Leelavati Narlikar,Snehal V Sambare,Rahul Siddharthan

doi:10.1093/nar/gkx1251

Ankit Agrawal, Leelavati Narlikar + Show 2 more

Open Access

PDF Available

https://doi.org/10.1093/nar/gkx1251

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.

Highlights

Significance criteria The likelihood of the windows in a cluster being all sampled from the same position weight matrices is
C is split in two clusters C1 and C2, a measure of the quality of the split is the log-likelihood ratio log (P (C1)P (C2)/P (C)) of the sequences being sampled from different PWMs versus their being sampled from the same PWM
Instead, that this split is possible on a pair of positions j and k: after the split, nucleotides at these positions in one cluster are always A and T and in the other are C and G

Summary

Introduction

Significance criteria The likelihood of the windows in a cluster being all sampled from the same position weight matrices is Where W is the length of the window, α = one of the nucleotides A, C, G or T, niα is the number of occurrences of nucleotide α at position i in the cluster, and c is a pseudocount (0.5 here). C is split in two clusters C1 and C2, a measure of the quality of the split is the log-likelihood ratio log (P (C1)P (C2)/P (C)) of the sequences being sampled from different PWMs versus their being sampled from the same PWM.

Results

Conclusion