Abstract
Transcription factors (TFs) are proteins that promote or reduce the expression of genes by binding short genomic DNA sequences known as transcription factor binding sites (TFBS). While several tools have been developed to scan for potential occurrences of TFBS in linear DNA sequences or reference genomes, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their variants in a single, compact data structure. Because VGs can losslessly compress large pangenomes, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by considering variations and alternative haplotypes encoded in a VG. Using GRAFIMO on a VG based on individuals from the 1000 Genomes project we recover several potential binding sites that are enhanced, weakened or missed when scanning only the reference genome, and which could constitute individual-specific binding events. GRAFIMO is available as an open-source tool, under the MIT license, at https://github.com/pinellolab/GRAFIMO and https://github.com/InfOmics/GRAFIMO.
Highlights
Transcription factors (TFs) are fundamental proteins that regulate transcriptional processes
To search for potential transcription factor binding sites (TFBS), GRAFIMO slides a window of length k along the paths of the variation graphs (VGs) corresponding to the genomic sequences encoded in it (Fig 1B)
We show that several potential and private TFBS are found in individual haplotype sequences and that genomic variants significantly affect the binding affinity of several motif occurrence candidates found in the reference genome sequence
Summary
Transcription factors (TFs) are key regulatory proteins and mutations occurring in their binding sites can alter the normal transcriptional landscape of a cell and lead to disease states. Pangenome variation graphs (VGs) efficiently encode genomes from a population of individuals and their genetic variations. GRAFIMO makes it possible to study how genetic variation affects the binding landscape of known TFs within a population of individuals. This is a PLOS Computational Biology Software paper
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have