Abstract

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.

Highlights

  • Genome sequencing costs continue to decline, resulting in a vast increase in the number of publicly available genome sequences spanning a wide range of diverse species

  • We develop an algorithm based on maximum entropy that consistently generates fully random nucleotide sequences with the desired amino acid composition and GC content

  • Creating random amino acid sequences with a specified amino acid composition can, in principle, be accomplished if we imagine a bag of marbles in which the number of marbles of a given color in the bag is proportional to the frequency of the corresponding amino acid in a desired sequence

Read more

Summary

Introduction

Genome sequencing costs continue to decline, resulting in a vast increase in the number of publicly available genome sequences spanning a wide range of diverse species. The exponential increase of sequencing data has lead to the development of computational pipelines, software tools, and algorithms to perform tasks such as genome annotation [1,2,3,4], phylogenetic inference [5, 6], and protein structure analysis [7,8,9] using only nucleotide sequences. All of these tools must be able to separate patterns within genome sequences from random noise. It was recently shown that protospacer adjacent motifs (PAM) are depleted in phages that infect hosts with the CRISPR system because evolution selects against phages with motifs that are recognized by the bacteria’s innate defense [13]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call