Abstract

This study delves further into the analysis of genomic data by computing a variety of complexity measures. We analyze the effect of window size and evaluate the precision and recall of the prediction of gene zones, aided with a much larger dataset (full chromosomes). A technique based on the separation of two cases (gene-containing and non-gene-containing) has been developed as a basic gene predictor for automated DNA analysis. This predictor was tested on various sequences of human DNA obtained from public databases, in a set of three experiments. The first one covers window size and other parameters; the second one corresponds to an analysis of a full human chromosome (198 million nucleic acids); and the last one tests subject variability (with five different individual subjects). All three experiments have high-quality results, in terms of recall and precision, thus indicating the effectiveness of the predictor.

Highlights

  • The analysis of complexity measures of genomic sequences is one of the pre-processing techniques that can lead to better pattern recognition and pattern inference in DNA sequences

  • The present research has been based on the fact that there is some insight into genomic data complexity and information content to discern between gene-containing regions and those that do not contain genes, based solely on quantifiable information from complexity measures

  • Using the experimental results and, in particular, the recall and precision ratios for the cases considered, it can be shown that data complexity does offer a high-quality prediction for gene coding zones, for the case of human DNA

Read more

Summary

Introduction

The analysis of complexity measures of genomic sequences is one of the pre-processing techniques that can lead to better pattern recognition and pattern inference in DNA (a type of nucleic acid called deoxyribonucleic acid) sequences. That initial study of the usage of complexity metrics showed that certain statistical properties of the sequence of complexity measures were significantly different for the subsequences that contained genes than for subsequences that did not contain genes, in spite of being tested on a relatively small dataset. These results suggest that it is worth pursuing these types of transformations even further, to convey information more precisely for the computational intelligence algorithms to be developed as part of future research. Biological life, encoded by DNA, fulfills all six attributes in the following way: (1) DNA contains large amounts of genes, and the relationship between genes is not yet fully understood; (2) most DNA and genetic expressions depend on previous events; (3) genes can adapt very to different conditions; (4) the relationship between genes, DNA and the environment is just being understood; (5) genes and DNA can be influenced by, or can adapt themselves to, their environment; and (6) gene expressions are sensitive to initial conditions and environmental factors

Previous Work in DNA Complexity Study
Measures for Data Complexity and Entropy
Shannon Entropy
Statistical Complexity
Kolmogorov Complexity
Characteristics of Human DNA
Predictor Proposal
Experimental Work
Description of the Experimental Processes
Analysis of Results
Experiment 1
Experiment 2
Experiment 3
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call