Abstract

Proteins are responsible for many tasks including cell growth and metabolism. Transcription, the process where genes are used as templates for the production of a messenger RNA intermediate used in the synthesis of proteins, is regulated to ensure that the cell has the appropriate response according to its current needs. An essential step in transcription is the binding of a group of proteins, collectively known as RNA polymerase, to short promoter sequences upstream of the genes to be transcribed. Automated identification of promoters and nearby regulatory sequences can help to predict which genes are likely to be active under a given set of conditions. However, promoters are short, highly variable, and belong to subclasses that sometimes overlap, making their recognition a very difficult problem. Several tools have been developed to identify promoters in DNA, but methods are generally tested on small, balanced subsets of genomic sequence, and the results may not reflect their expected performance on genomes with millions of DNA base pairs in length where only <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim$</tex> 1% of sequence is expected to correspond to promoters. Here we introduce Expositor, a neural-network-based method that uses different types of DNA encodings and tunable sensitivity and specificity parameters. Although the performance of Expositor on balanced datasets was comparable to that of other approaches, at the genome scale our approach finds the highest number of promoters (70% against 46%) with the smallest number of false positives. We also examined the accuracy of Expositor in distinguishing different classes of promoters, and found that misclassification between classes was consistent with the biological similarity between promoters. Expositor source code and pretrained model, and the datasets used for training and testing can be accessed at https://github.com/beiko-lab/Expositor.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call