How to Enable Index Scheme for Reducing the Writing Cost of DNA Storage on Insertion and Deletion

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Recently, the requirement of storing digital data has been growing rapidly; however, the conventional storage medium cannot satisfy these huge demands. Fortunately, thanks to biological technology development, storing digital data into deoxyribonucleic acid (DNA) has become possible in recent years. Furthermore, because of the attractive features (e.g., high storing density, long-term durability, and stability), DNA storage has been regarded as a potential alternative storage medium to store massive digital data in the future. Nevertheless, reading and writing digital data over DNA requires a series of extremely time-consuming processes (i.e., DNA sequencing and DNA synthesis). More specifically, among the two costs, the writing cost is the predominant cost of a DNA data storage system. Therefore, to enable efficient DNA storage, this article proposes an index management scheme for reducing the number of accesses to DNA storage. Additionally, this article introduces a new DNA data encoding format with VERA (Version Editing Recovery Approach) to reduce the total writing bits while inserting and deleting the data. To the best of our knowledge, this work is the first work to provide a total data management solution for DNA storage. According to the experimental results, the proposed design with VERA can reduce the cost by 77% and improve the performance by 71% compared to the append-only methods.

Similar Papers
  • Research Article
  • Cite Count Icon 22
  • 10.1109/tase.2006.871483
In situ DNA synthesis on glass substrate for microarray fabrication using self-focusing acoustic transducer
  • Apr 1, 2006
  • IEEE Transactions on Automation Science and Engineering
  • Jae Wan Kwon + 2 more

This paper presents a droplet-ejection-based technique for synthesizing deoxyribonucleic acid (DNA) sequences on different substrates, such as glass, plastic, or silicon. Any DNA sequence can be synthesized by ejecting droplets of DNA bases by a self-focusing acoustic transducer (SFAT) that does not require any nozzles. An SFAT can eject liquid droplets around 3-5 /spl mu/m in diameter, which is significantly smaller than those ejected by commercial ink jet printers and reduces the amount of reagents needed for the synthesis. An array of SFATs is integrated with microchannels and reservoirs for delivery of DNA bases to the SFATs. Poly-l-lysine-coated glass slide is patterned, and is used as a target substrate for in situ synthesis of multiple T bases. The significant advantage of this scheme over some of the existing commercial solutions is that it can allow geneticists to synthesize any DNA sequence within hours using a computer program at an affordable cost in their own labs. This paper describes the concept and scheme of the on-demand DNA synthesis (with an acoustic ejector integrated with microfluidic components) along with the results of an actual DNA synthesis by an SFAT. Note to Practitioners-Deoxyribonucleic acid (DNA) microarrays allow geneticists to monitor the interactions among thousands of genes simultaneously in a chip. There are commercial systems for producing DNA microarrays, but none of them give flexibility to synthesize DNA microarrays on-demand in the geneticist's own lab. Affymetrix's GeneChip technology produces DNA probe sequences premade at Affymetrix with a set of 4n photomasks for n-mers. Other techniques transfer premade DNA sequences to a substrate (glass, plastic, or silicon) through ink-jet printing or contact dispensing. Agilent and Rosetta use their ink-jet printing technology to produce DNA probe sequences at their factories. The ink-jet print heads used for printing microarrays use either piezoelectric or thermal actuation, and eject liquid droplets through nozzles. Thus, the smallest droplet size ejected from these devices depends on the size of the nozzle. The small nozzles are difficult to construct with good uniformity and tend to get clogged. The idea presented in this paper is to develop a microelectromechanical-system (MEMS)-based portable system for synthesizing DNA on different substrates, using nozzleless, heatless, lensless, acoustic droplet ejectors. The future research is to synthesize longer DNA sequences with a combination of different bases, using directional droplet ejectors.

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tcbb.2024.3493203
Performance Comparison Between Deep Neural Network and Machine Learning Based Classifiers for Huntington Disease Prediction From Human DNA Sequence.
  • Jan 1, 2025
  • IEEE transactions on computational biology and bioinformatics
  • C Vishnuppriya + 1 more

Huntington Disease (HD) is a type of neurodegenerative disorder which causes problems like psychiatric disturbances, movement problem, weight loss and problem in sleep. It needs to be addressed in earlier stage of human life. Nowadays Deep Learning (DL) based system could help physicians provide second opinion in treating patient's disease. In this work, human Deoxyribo Nucleic Acid (DNA) sequence is analyzed using Deep Neural Network (DNN) algorithm to predict the HD disease. The main objective of this work is to identify whether the human DNA is affected by HD or not. Human DNA sequences are collected from National Center for Biotechnology Information (NCBI) and synthetic human DNA data are also constructed for process. Then numerical conversion of human DNA sequence data is done by Chaos Game Representation (CGR) method. After that, numerical values of DNA data are used for feature extraction. Mean, median, standard deviation, entropy, contrast, correlation, energy and homogeneity are extracted. Additionally, the following features such as counts of adenine, thymine, guanine and cytosine are extracted from the DNA sequence data itself. The extracted features are used as input to the DNN classifier and other machine learning based classifiers such as NN (Neural Network), Support Vector Machine (SVM), Random Forest (RF) and Classification Tree with Forward Pruning (CTWFP). Six performance measures are used such as Accuracy, Sensitivity, Specificity, Precision, F1 score and Mathew Correlation Co-efficient (MCC). The study concludes DNN, NN, SVM, RF achieve 100% accuracy and CTWFP achieves accuracy of 87%.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1002/9780470015902.a0025334
Development and Role of the Human Reference Sequence in Personal Genomics
  • Jun 16, 2014
  • Encyclopedia of Life Sciences
  • Todd M Smith + 1 more

Genome maps, like geographical maps, need to be interpreted carefully. Although maps are essential to exploration and navigation they cannot be completely accurate. Humans have been mapping the world for several millennia, but genomes have been mapped and explored for just a single century with the greatest advancements in making a sequence reference map of the human genome possible in the past 30 years. After the deoxyribonucleic acid (DNA) sequence of the human genome was completed in 2003, the reference sequence underwent several improvements and today provides the underlying comparative resource for a multitude genetic assays and biochemical measurements. However, the ability to simplify genetic analysis through a single comprehensive map remains an elusive goal. Key Concepts: Maps are incomplete and contain errors. DNA sequence data are interpreted through biochemical experiments or comparisons to other DNA sequences. A reference genome sequence is a map that provides the essential coordinate system for annotating the functional regions of the genome and comparing differences between individuals' genomes. The reference genome sequence is always product of understanding at a set point in time and continues to evolve. DNA sequences evolve through duplication and mutation and, as a result, contain many repeated sequences of different sizes, which complicates data analysis. DNA sequence variation happens on large and small scales with respect to the lengths of the DNA differences to include single base changes, insertions, deletions, duplications and rearrangements. DNA sequences within the human population undergo continual change and vary highly between individuals. The current reference genome sequence is a collection of sequences, an assembly, that include sequences assembled into chromosomes, sequences that are part of structurally complex regions that cannot be assembled, patches (fixes) that cannot be included in the primary sequence, and high variability sequences that are organised into alternate loci. Genetic analysis is error prone and the data require validation because the methods for collecting DNA sequences create artifacts and the reference sequence used for comparative analyses is incomplete.

  • Supplementary Content
  • Cite Count Icon 2
  • 10.13140/rg.2.2.14683.00806
Analysis of Compression Techniques for DNA Sequence Data
  • Jun 1, 2020
  • arXiv (Cornell University)
  • Shakeela Bibi + 3 more

Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.

  • Book Chapter
  • Cite Count Icon 10
  • 10.1007/978-3-030-95388-1_19
A Strategy-based Optimization Algorithm to Design Codes for DNA Data Storage System
  • Jan 1, 2022
  • Abdur Rasool + 3 more

The exponential increase of big data volumes demands a large capacity and high-density storage. Deoxyribonucleic acid (DNA) has recently emerged as a new research trend for data storage in various studies due to its high capacity and durability, where primers and address sequences played a vital role. However, it is a critical biocomputing task to design DNA strands without errors. In the DNA synthesis and sequencing process, each nucleotide is repeated, which is prone to errors during the hybridization reactions. It decreases the lower bounds of DNA coding sets which causes the data storage stability. This study proposes a metaheuristic algorithm to improve the lower bounds of DNA data storage. The proposed algorithm is inspired by a moth-flame optimizer (MFO), which has exploration and exploitation capability in one dimension, and it is enhanced by opposition-based learning (OBL) strategy with three-dimension search space for the optimal solution; hereafter, it is MFOL algorithm. This algorithm is programmed to construct the DNA storage codes by reducing the error rates of DNA coding sets with GC-content, Hamming distance, and No-runlength constraints. In experiments, 13 benchmark functions and Wilcoxon rank-sum test are implemented, and performances are compared with the original MFO and three other algorithms. The generated DNA codewords by MFOL are compared with a state-of-the-art Altruistic algorithm and KMVO algorithm. The proposed algorithm improved 30% DNA coding rates with shorter sequences, reducing errors during DNA synthesis and sequencing.

  • Conference Article
  • 10.1117/12.486714
<title>DNA sequence similarity search through content-based retrieval technique</title>
  • Aug 27, 2003
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Chia Hung Yeh + 3 more

Deoxyribonucleic acid (DNA) sequences are difficult to analyze similarity due to their length and complexity. The challenge lies in being able to use digital signal processing (DSP) to solve highly relevant problems in DNA sequences. Here, we transfer a one-dimensional (1D) DNA sequence into a two-dimensional (2D) pattern by using the Peano scan algorithm. Four complex values are assigned to the characters "A", "C", "T", and "G", respectively. Then, Fourier transform is employed to obtain far-field amplitude distribution of the 2D pattern. Hereto, a 1D DNA sequence becomes a 2D image pattern. Features are extracted from the 2D image pattern with the Principle Component Analysis (PCA) method. Therefore, the DNA sequence database can be established. Unfortunately, comparing features may take a long time when the database is large since multi-dimensional features are often available. This problem is solved by building indexing structure like a filter to filter-out non-relevant items and select a subset of candidate DNA sequences. Clustering algorithms can organize the multi-dimensional feature data into the indexing structure for effective retrieval. Accordingly, the query sequence can be only compared against candidate ones rather than all sequences in database. In fact, our algorithm provides a pre-processing method to accelerate the DNA sequence search process. Finally, experimental results further demonstrate the efficiency of our proposed algorithm for DNA sequences similarity retrieval.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/15257770.2021.1951755
Optimum model selection and statistical analysis for DNA sequences
  • Jul 8, 2021
  • Nucleosides, Nucleotides & Nucleic Acids
  • Ahmed M Dessouky + 3 more

In this article, we study the statistical characteristics and examine the performance of original representation and mathematical modelling of deoxyribonucleic acid (DNA) sequences. The proposed mathematical modelling approach is presented to create closed formulas for the original DNA data sequences with different methods. Accuracy of representation is studied based on evaluation metric values. The root Mean Squared Error (RMSE) and correlation coefficient (R) are used for examining the accuracy of all mathematical models to select the optimum one for DNA representation. In addition, statistical parameters such as energy, entropy, standard deviation, variance, mean, range, Mean Absolute Deviation (MAD), skewness and kurtosis are also used for the selection of the optimum model for DNA representation. Finally, spectral estimation methods are used for exon prediction, which means determination of the coding region (exon) for actual sequences and selected mathematical model: Sum of Sinusoids (SoS) with 8 terms and Gaussian with 8 terms. The exon prediction results from original DNA sequences and mathematically modelled DNA sequences coincide and ensure the success of the proposed sum-of-­sinusoids for modelling of DNA sequences, while the Gaussian model is not appropriate for this task.

  • Research Article
  • 10.1158/1940-6207.prev-09-a45
Abstract A45: Kupffer cell and EtOH DNA synthesis
  • Jan 7, 2010
  • Cancer Prevention Research
  • Solomon E Owumi

The Kupffer cells (KC) are the resident hepatic macrophage located in the sinusoidal space of the liver. KC plays an important role in hepatic homeostasis and in the response of the liver to xenobiotics. It phagocytizes foreign bodies, old red blood cell and interacts with endotoxin resulting in its activation. Activated KC releases cytokines such as tumor necrosis factor alpha (TNF-α) and reactive oxygen species (ROS) like superoxide. TNF-α and superoxide have been implicated in signalling transduction involved in cell growth, gene expression and apoptosis. Ethanol (EtOH) consumption has been causally linked to the etiology of primary hepatocellular cancer (HCC), although EtOH is not a known carcinogen. One such mechanism of KC activation is via EtOH-induced endotoxemia, which leads to an increase in cytokine and ROS production by the KC, consequently impacting on and potentially induces hepatocyte growth. Here we propose that depletion of Kupffer cells will attenuate the effect of EtOH induced endotoxemia, decreasing hepatocyte deoxyribonucleic acid (DNA) synthesis required for cell growth. Rate of hepatocyte DNA synthesis was examined as a marker of cell growth in C57BL6 mice depleted of KC (KC−) with clodronate-liposome and in KC competent mice (KC+). Both groups of mice were fed an isocaloric EtOH-liquid diet (3%w/v EtOH). Control mice were fed a liquid diet without EtOH for 1 week. DNA synthesis was assessed by 5-bromo-2′-deoxyuridine (BrdU) incorporation into hepatocytes undergoing replicative DNA synthesis by BrdU-immunohistochemistry. Apoptotic bodies' formation was examined by the TUNEL assay and Extracellular Regulated Kinases (ERK 1/2) protein phosphorylation believed to be involved in growth signalling pathway was evaluated by western blotting. TNF-α released was assessed by total mRNA transcript via RT-PCR. Toxicity was assessed by the presence of liver transaminases aspartate amino transferase (AST) and alanine amino transferase (ALT) in serum. Serum AST and ALT were within normal reference range in control and treated mice indicative of insignificant EtOH or clodronate induced toxicity. In KC+ mice fed EtOH hepatic DNA synthesis increased 183% compared to control KC+ mice. Hepatocyte DNA synthesis in KC− remain at the same level with control KC+ mice. In KC− mice fed EtOH there was a 50% decrease in hepatic DNA synthesis when compared to control KC+ mice and a 74% decrease compared to (KC+) mice fed EtOH-liquid diet. TNF-α released was 148% in KC+ mice fed EtOH, 69% in KC− and 72% in KC− fed EtOH compared to control mice KC+. However, there was a slight increase in TNF-α released in KC− fed EtOH-liquid diet when compared to KC− control. An increase in the phosphorylation of protein kinases p42/44 (ERK1/2) in the control mice KC+ fed EtOH was observed when compared to KC− mice fed ethanol. There was a slight increase in apoptotic bodies' accumulations in KC− fed EtOH-liquid diet and in KC− control mice. Taken together these observations suggest that EtOH induces hepatocyte DNA synthesis in KC+ mice and to a lesser extent in KC− mice indicative of a role for the KC in hepatocyte DNA synthesis and may be involved in the development of hepatocarcinogenesis. Depletion of KC attenuates the downstream effect of ethanol induced endotoxemia, potentially by a mechanism involving reduced TNF-α and ROS production with its concomitant effect on ERK 1/2 signaling pathway on hepatocyte DNA synthesis thereby supporting our hypothesis. Citation Information: Cancer Prev Res 2010;3(1 Suppl):A45.

  • Research Article
  • Cite Count Icon 3
  • 10.1051/e3sconf/202341201090
DNA technology for big data storage and error detection solutions: Hamming code vs Cyclic Redundancy Check (CRC)
  • Jan 1, 2023
  • E3S Web of Conferences
  • Manar Sais + 2 more

There is an increasing need for high-capacity, highdensity storage media that can retain data for a long time, due to the exponential development in the capacity of information generated. The durability and high information density of synthetic deoxyribonucleic acid (DNA) make it an attractive and promising medium for data storage. DNA data storage technology is expected to revolutionize data storage in the coming years, replacing various Big Data storage technologies. As a medium that addresses the need for high-latency, immutable information storage, DNA has several potential advantages. One of the key advantages of DNA storage is its extraordinary density. Theoretically, a gram of DNA can encode 455 exabytes, or 2 bits per nucleotide. Unlike other digital storage media, synthetic DNA enables large quantities of data to be stored in a biological medium. This reduces the need for traditional storage media such as hard disks, which consume energy and require materials such as plastic or metals, and also often leads to the generation of electronic waste when they become obsolete or damaged. Additionally, although DNA degrades over thousands of years under non-ideal conditions, it is generally readable. Furthermore, as DNA possesses natural reading and writing enzymes as part of its biological functions, it is expected to remain the standard for data retrieval in the foreseeable future. However, the high error rate poses a significant challenge for DNA-based information coding strategies. Currently, it is impossible to execute DNA strand synthesis, amplification, or sequencing errors-free. In order to utilize synthetic DNA as a storage medium for digital data, specialized systems and solutions for direct error detection and correction must be implemented. The goal of this paper is to introduce DNA storage technology, outline the benefits and added value of this approach, and present an experiment comparing the effectiveness of two error detection and correction codes (Hamming and CRC) used in the DNA data storage strategy.

  • Research Article
  • Cite Count Icon 7
  • 10.1002/nano.202100275
Current and emerging opportunities in biological medium‐based computing and digital data storage
  • Nov 7, 2021
  • Nano Select
  • Devasier Bennet + 2 more

Traditional storage methods have limitations and concerns regarding capacity, decay, and sustainability. These drawbacks can be mitigated by developing long‐term digital information storage systems using deoxyribonucleic acid (DNA), which are often referred to as DNA‐based data storage. These advanced technologies for storing big data are emulated by DNA synthesis, DNA sequencing, and encoding and decoding algorithms that can pack information into DNA, extreme durability, environmental sustainability, energy conservation, and eternal relevance, and at higher density than the conventional systems. This field has arisen to become a hot topic for researchers in the past decade, with significant breakthroughs in its course. This review provides a comprehensive overview of the latest advances in in vivo DNA digital storage and in vitro DNA digital storage with novel modalities, preservation techniques, applications, and practical and technical issues. Also summarize the field of in vivo molecular writing mode that records and stores data within cells' genomes, which lie at the growing intersection of biocomputing and biotechnology.

  • Research Article
  • Cite Count Icon 35
  • 10.1016/j.ijinfomgt.2018.08.011
A lossless DNA data hiding approach for data authenticity in mobile cloud based healthcare systems
  • Sep 25, 2018
  • International Journal of Information Management
  • Mohammad Saidur Rahman + 2 more

A lossless DNA data hiding approach for data authenticity in mobile cloud based healthcare systems

  • Research Article
  • Cite Count Icon 271
  • 10.1111/1467-985x.00264
Inferences from DNA Data: Population Histories, Evolutionary Processes and Forensic Match Probabilities
  • May 6, 2003
  • Journal of the Royal Statistical Society Series A: Statistics in Society
  • Ian J Wilson + 2 more

SummaryWe develop a flexible class of Metropolis–Hastings algorithms for drawing inferences about population histories and mutation rates from deoxyribonucleic acid (DNA) sequence data. Match probabilities for use in forensic identification are also obtained, which is particularly useful for mitochondrial DNA profiles. Our data augmentation approach, in which the ancestral DNA data are inferred at each node of the genealogical tree, simplifies likelihood calculations and permits a wide class of mutation models to be employed, so that many different types of DNA sequence data can be analysed within our framework. Moreover, simpler likelihood calculations imply greater freedom for generating tree proposals, so that algorithms with good mixing properties can be implemented. We incorporate the effects of demography by means of simple mechanisms for changes in population size and structure, and we estimate the corresponding demographic parameters, but we do not here allow for the effects of either recombination or selection. We illustrate our methods by application to four human DNA data sets, consisting of DNA sequences, short tandem repeat loci, single-nucleotide polymorphism sites and insertion sites. Two of the data sets are drawn from the male-specific Y-chromosome, one from maternally inherited mitochondrial DNA and one from the β-globin locus on chromosome 11.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1002/9780470015902.a0001000.pub3
Mutagenesis: Site‐Specific
  • Jun 16, 2014
  • Encyclopedia of Life Sciences
  • Mari Walquist + 2 more

Site‐specific mutagenesis techniques, also known as site‐directed mutagenesis (SDM), aim to introduce precise alterations in any coding or noncoding deoxyribonucleic acid (DNA) sequence, usually in vitro . These modifications could be as small as a nucleotide or several hundreds; in one site or in multisite in the same DNA sequence. Recently, these alterations have been also developed in vivo . SDM success depends on how changes are introduced and mutant selection is done. DNA sequence analysis has to be made to verify change(s) before any biochemical or biological experiments are done. Recent methods for SDM and most used commercial kits are discussed. A list of companies offering SDM service is included. The authors have also listed software used for mutagenic oligonucleotide primer‐design. These techniques are revolutionising our understanding of the genetic and molecular mechanisms, protein structure–function relationship, protein–protein interaction, binding sites in any biological system. In addition to the academic benefits of SDM, SDM techniques have impacted biotechnology and the applied field such as engineering new enzymes, drug development, optimisation of heterologous gene expression and secretion. Key Concepts: All site‐specific alterations requiring site‐directed mutagenesis technique are done at the DNA level making it heritable modifications. Modifications done at the protein levels are not heritable. The results of these alterations are reflected in the encoded amino acids sequence of the proteins or in any targeted binding site in the DNA sequence. Several simplified Techniques are now available. Selection of the altered DNA molecules from the pool of nonmodified parental molecules is essential. DNA sequence to verify the DNA change is fundamental part of the technique. Biological and biochemical ramifications as a result of SDM are usually the purpose that SDM is done in the first place.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/iced.2016.7804650
DNA sequence alignment: A review of hardware accelerators and a new core architecture
  • Aug 1, 2016
  • D S Nurdin + 2 more

Deoxyribonucleic Acid (DNA) sequence alignment is essentially a way of comparing two or more DNA sequences with aim to find regions of similarities among them. The Smith-Waterman (SW) algorithm is a local alignment algorithm which is able to identify mutation in DNA sequences. However, the aforementioned algorithm tends to be slower in computation of long DNA sequences. Over decades ago, Field Programmable Gate Arrays (FPGAs) play an important role in DNA sequence alignment. Moreover, pipelining technique is also a well-known method used to speed-up the performance of hardware design. Systolic array (SA)-based DNA sequence alignment architecture reduces execution time of alignment matrix computation from quadratic to linear time complexity. In this paper, existing FPGA-based sequence alignment core architectures will be discussed followed by proposal for a new SA-based DNA sequence alignment core architecture. The design was synthesized on the Xilinx Spartan-3E XC3S1600E-FG3205. Results showed that the developed core architecture is 1.2× faster in speed as compared to other reported FPGA-based designs.

  • Research Article
  • Cite Count Icon 2
  • 10.4015/s1016237221500526
MODIFIED P-SPECTRUM-BASED APPROACH TO ENHANCE SENSITIVITY FOR THE DETECTION OF CpG ISLANDS IN DNA SEQUENCES IN HUMAN SPECIES
  • Oct 20, 2021
  • Biomedical Engineering: Applications, Basis and Communications
  • Pardeep Garg + 1 more

CpG Island (CGI) is considered to be one of the important segments of deoxyribonucleic acid (DNA) sequences. Out of the various epigenetic events which are associated with CGIs, some such events are like: CGIs are useful in the prediction of promoter region and subsequently for gene prediction, CGIs’ contribution in finding the epigenetic reasons of cancer is of great importance, CGIs can be used to identify chromosome inactivation. Therefore, the exact and maximum number of CGIs hidden in DNA sequences need to explored. A lot of computational, transform-based approaches have been developed and reported in literature for the identification of CGIs in DNA sequences since last many years. The problem associated with transform-based approaches is that the domain of functioning of algorithm requires to be changed which can probably lead to biasing and result in loss of important information in terms of CGIs. Hence, to provide a solution to this issue, a modified P-spectrum-based approach has been proposed here which does not suffer from domain transformation issue. Also, the performance of proposed algorithm has been tested on a large data set of 100 DNA sequences of human species and the performance has been compared with other recently reported methods of CGIs identification in DNA sequences. The results obtained prove that the proposed algorithm is better than the existing methods in terms of identification of more number of CGIs in DNA sequences. Therefore, the proposed algorithm has been considered as an efficient approach to enhance the sensitivity of CGIs identification.

Save Icon
Up Arrow
Open/Close