Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Highlights
Predicting the structures, functions, and evolutionary relationships of genes is a fundamental and vital aspect of modern biological research
To demonstrate that positional correlation natural vector (PCNV) is effective, we applied it to different datasets: the genomes of hepatitis C virus (HCV), hepatitis B virus (HBV), human papillomavirus (HPV), dengue virus (DENV), and 59 bacterial species
We found that PCNV categorizes the dataset into the correct biological groups in 0.78 s (Figure 4a; Table 1); this is much faster than the feature frequency profiles (FFP) method, which takes 35 s (Table 1)
Summary
Predicting the structures, functions, and evolutionary relationships of genes is a fundamental and vital aspect of modern biological research. A notable common feature of AF approaches is the analysis of special numerical properties of the sequences being compared. AF approaches include iterated-function systems [5], information theory [6], Fourier transformations [7], sequence representations based on chaos theory [8], and moments of the positions of the nucleotides [9,10]. The most widely used AF method is the k-mer-based method and has been published in many excellent journals [11,12,13,14,15,16,17,18,19] This method involves the analysis of the frequency of strings of specific length k within sequences [20]. Several k-mer-based methods have been developed and applied for the phylogenetic analysis of bacteria and viruses. A notable example is feature frequency profiles (FFP) [21]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.