Abstract

CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.

Highlights

  • In the recent years, computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest

  • In this article, a new digital signal processing (DSP)-based technique using Statistically optimal null filters (SONF) is proposed for the prediction of CpG islands (CGIs) in deoxyribonucleic acid (DNA) sequences

  • A novel CPG identification characteristic is presented in the form of a binary basis sequence which is shown to identify CGIs reliably

Read more

Summary

Introduction

Computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest Genomic sequences such as deoxyribonucleic acid (DNA) consist of biological instructions which are crucial for the development and normal functioning of almost all living organisms [1]. A DNA molecule has a complex double helix structure that involves two strands, consisting of alternating sugar and phosphate groups. Attached to these sugar groups of each DNA strand are one of the four chemical bases, namely, adenine (A), thymine (T), guanine (G), and cytosine (C). Nucleotides in a DNA strand exhibit short, recurring patterns (

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call