Abstract

The fundamental step in genomic signal processing applications is to assign mathematical descriptor to nucleotides {A, T, G, C} of DNA molecule for discrete representation. The discrete representation should replicate biological information of gene when analyzed with digital signal processing tools. In this aspect, a novel binary representation of DNA sequence by combining structural and chemical information of original DNA sequence has been proposed for the identification of protein coding regions of eukaryotes. The identification model comprises two stages, mainly, numerical encoding in first stage, and analysis of biological behavior through digital signal processing algorithms in second stage. In the first stage, a new numerical encoding method based on Walsh codes of order-4 is proposed to obtain 1-D binary discrete sequence. In the second stage, the modified Gabor wavelet transform (MGWT) is employed on the discretized DNA sequence for spectrum analysis. The optimal gene numerical encoding and multiresolution approach of MGWT has readily identified the structures of coding regions of unknown gene sequences. The proposed model is validated by analyzing prediction efficiency in terms of statistical metrics such as sensitivity, specificity, accuracy on both sequence and data base level. Furthermore, the results are compared by plotting receiver operating curves (ROC) for all classification thresholds for the state-of-art encoding methods. Area under curve (AUC) value of 0.86 at sequence level and 0.84 at database level is achieved. Performance metrics indicate that the proposed encoding method exhibits relatively better performance than other numerical encoding methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call