Abstract

The analysis of protein coding regions of DNA sequences is one of the most fundamental applications in bioinformatics. A number of model-independent approaches have been developed for differentiating between the protein-coding and non-protein-coding regions of DNA. However, these methods are often based on univariate analysis algorithms, which leads to the loss of joint information among four nucleotides of DNA. In this article, we introduce a method on basis of the noise-assisted multivariate empirical mode decomposition (NA-MEMD) and the modified Gabor-wavelet transform (MGWT). The NA-MEMD algorithm, as a multivariate analysis tool, is utilized to reconstruct the numerical analyzed sequence since it enables a matched-scale decomposition across all variables and eliminates the mode mixing. By virtues of NA-MEMD, the MGWT method achieves a stable improvement on the general identification performance. We compare our method with other Digital Signal Processing (DSP) methods on two representative DNA sequences and three benchmark datasets. The results reveal that our method can enhance the spectra of the analyzed sequences, and improve the robustness of MGWT to different DNA sequences, thus obtaining higher identification accuracies of protein coding regions over other applied methods. In addition, another comparative experiment with the model-dependent method (AUGUSTUS) on the recently proposed benchmark dataset G3PO verifies the superiority of model-independent methods (especially NA-MEMD-MGWT) for identifying coding regions of the poor-quality DNA sequences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call