Abstract

BackgroundDetecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved.MethodsIn this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets.ResultsComparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences.ConclusionsThis paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences.

Highlights

  • Detecting the borders between coding and non-coding regions is an essential step in the genome annotation

  • We introduced a 22-symbol alphabet that took into account the non-uniform distribution of di-nucleotides and SCPs in both DNA strands (Table 1 and Table 2)

  • The DNA segment was randomly chosen from the bacterium genome Borrelia burgdorferi and Rickettsia prowazekii

Read more

Summary

Introduction

Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. Information entropy measures are useful for describing the signals in genome sequence. The accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. Lots of methods for finding probable borders are based on strong signals between the coding regions and the non-coding ones [7,8]. Staden [9] used the intersection method to detect the borders between coding and non-coding regions. The information entropy measures for signals are useful for identifying the homogeneous. Based on the entropy theory, we used recursive segmentation to detect the borders between coding and non-coding DNA regions. It is shown that our accuracy was well improved

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call