Fast and exact quantification of motif occurrences in biological sequences

Mattia Prosperi,Simone Marini,Christina Boucher

doi:10.1186/s12859-021-04355-6

Mattia Prosperi, Simone Marini + Show 1 more

Open Access

https://doi.org/10.1186/s12859-021-04355-6

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Sep 18, 2021
Citations: 3	License type: open-access

Affiliation: University of Florida

Abstract

BackgroundIdentification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.ResultsWe implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob.ConclusionsThe motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

Highlights

Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms
Motif discovery and characterization are important for the study of gene evolution, duplication, transcription sites, and protein identification [1], as well as of genetic diseases caused by unstable repeat expansion [2, 3]
Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large data sets

Summary

Results

Both the Perl and the C++ programs exhibit run times several orders of magnitude smaller than MoSDi, even when the latter is executed with the fast compound Poisson approximation. This variability is due to: the individual nucleotide content, which can differ even when the GC content is the same, and it directly affects the distribution (see Fig. 2); the genome length; and the nucleotide content of the query motifs

Conclusions

Background

Methods

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Fast and exact quantification of motif occurrences in biological sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Whole-Genome Sequence Analysis of Antimicrobial Resistance Genes in Streptococcus uberis and Streptococcus dysgalactiae Isolates from Canadian Dairy Herds.
Julián Reyes Vélez ... Fangfang Xia
Frontiers in Veterinary Science | VOL. 4
Julián Reyes Vélez, et. al.Julián Reyes Vélez ... Fangfang Xia
22 May 2017
Frontiers in Veterinary Science | VOL. 4

The distribution of GC nucleotides and regulatory sequence motifs in genes and their adjacent sequences
Roman Jaksik ... Joanna Rzeszowska-Wolny
Gene | VOL. 492
Roman Jaksik, et. al.Roman Jaksik ... Joanna Rzeszowska-Wolny
11 Nov 2011
Gene | VOL. 492

Metagenomic comparison of effects of mesophilic and thermophilic manure anaerobic digestion on antimicrobial resistance genes and mobile genetic elements
Daniel Flores-Orozco ... Nazim Cicek
Environmental Advances | VOL. 15
Daniel Flores-Orozco, et. al.Daniel Flores-Orozco ... Nazim Cicek
20 Dec 2023
Environmental Advances | VOL. 15

Systematic In Silico Assessment of Antimicrobial Resistance Dissemination across the Global Plasmidome.
Miquel Sánchez-Osuna ... Jordi Barbé
Antibiotics (Basel, Switzerland) | VOL. 12
Miquel Sánchez-Osuna, et. al.Miquel Sánchez-Osuna ... Jordi Barbé
01 Feb 2023
Antibiotics (Basel, Switzerland) | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast and exact quantification of motif occurrences in biological sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics