Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss.

Mengchi Wang,Kai Zhang,Shicai Fan,Wei Wang,David Wang,Vu Ngo

doi:10.1534/genetics.120.303597

Abstract

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.

Highlights

Motif analysis is crucial for uncovering sequence patterns, such as protein-binding sites on nucleic acids, splicing sites, epigenetic modification markers and structural elements[1 ].A motif is typically represented as a Position Weight Matrix (PWM), in which each entry shows the occurrence frequency of a certain type of nucleic acid at each position of the motif
Motifs can be sufficiently represented by regular expressions of the consensus sequences, such as [GC][AT]GATAAG[GAC] for the GATA2 motif
In the GATA2 motif example, the GATAAG consensus in the center is the most prominent pattern that would be read off the PWM or sequence logo

Summary

INTRODUCTION

Motif analysis is crucial for uncovering sequence patterns, such as protein-binding sites on nucleic acids, splicing sites, epigenetic modification markers and structural elements[1 ].A motif is typically represented as a Position Weight Matrix (PWM), in which each entry shows the occurrence frequency of a certain type of nucleic acid at each position of the motif. Motifs can be sufficiently represented by regular expressions of the consensus sequences, such as [GC][AT]GATAAG[GAC] for the GATA2 motif. This representation is the most compact and intuitive way to delineate a motif. In the GATA2 motif example, the GATAAG consensus in the center is the most prominent pattern that would be read off the PWM or sequence logo. For this reason, consensus sequences are still widely used by the scientific community. We have implemented an lightweight and easy-to-use Python package with versatile options for the biologists

METHODS

FEATURES AND EXAMPLES

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genetics	Publication Date: Oct 1, 2020
Citations: 4	License type: CC BY

R Discovery Prime

R Discovery Prime

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genetics

Lead the way for us

Similar Papers

Jury remains out on simple models of transcription factor specificity
Quaid Morris ... Martha L Bulyk
Nature Biotechnology | VOL. 29
Quaid Morris, et. al.Quaid Morris ... Martha L Bulyk
01 Jun 2011
Nature Biotechnology | VOL. 29

Variation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences
Michael F Berger ... Timothy R Hughes
Cell | VOL. 133
Michael F Berger, et. al.Michael F Berger ... Timothy R Hughes
01 Jun 2008
Cell | VOL. 133

Abc4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis
Omer Ali ... Junbai Wang
BMC Bioinformatics | VOL. 23
Omer Ali, et. al.Omer Ali ... Junbai Wang
03 Mar 2022
BMC Bioinformatics | VOL. 23

Increasing Coverage of Transcription Factor Position Weight Matrices through Domain-level Homology
Brady Bernard ... Dmitry I Nurminsky
PLoS ONE | VOL. 7
Brady Bernard, et. al.Brady Bernard ... Dmitry I Nurminsky
27 Aug 2012
PLoS ONE | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genetics