Abstract

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.

Highlights

  • Motif analysis is crucial for uncovering sequence patterns, such as protein-binding sites on nucleic acids, splicing sites, epigenetic modification markers and structural elements[1​ ].A motif is typically represented as a Position Weight Matrix (PWM), in which each entry shows the occurrence frequency of a certain type of nucleic acid at each position of the motif

  • Motifs can be sufficiently represented by regular expressions of the consensus sequences, such as [GC][AT]GATAAG[GAC] for the GATA2 motif

  • In the GATA2 motif example, the GATAAG consensus in the center is the most prominent pattern that would be read off the PWM or sequence logo

Read more

Summary

INTRODUCTION

Motif analysis is crucial for uncovering sequence patterns, such as protein-binding sites on nucleic acids, splicing sites, epigenetic modification markers and structural elements[1​ ].A motif is typically represented as a Position Weight Matrix (PWM), in which each entry shows the occurrence frequency of a certain type of nucleic acid at each position of the motif. Motifs can be sufficiently represented by regular expressions of the consensus sequences, such as [GC][AT]GATAAG[GAC] for the GATA2 motif. This representation is the most compact and intuitive way to delineate a motif. In the GATA2 motif example, the GATAAG consensus in the center is the most prominent pattern that would be read off the PWM or sequence logo. For this reason, consensus sequences are still widely used by the scientific community. We have implemented an lightweight and easy-to-use Python package with versatile options for the biologists

METHODS
FEATURES AND EXAMPLES
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call