Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Highlights
Phylogenetic tree analysis and comparative studies of taxa are essential parts of modern molecular biology
Examples of match length methods are, k-mismatch average common substring[32], average common substring[28], Kr – method[28], etc. These methods are commonly used for string processing in computer science
We propose to use fuzzy integral[33] to analyze DNA sequences based on a Markov chain[34], which can be categorised as k-mer or word frequency method
Summary
Phylogenetic tree analysis and comparative studies of taxa are essential parts of modern molecular biology. Large amounts of sequence data produced by next-generation sequencing techniques have become available in private and public databases, which has created new challenges due to the limitations associated with alignment based approaches. This plethora of sequence information increases the computation and time requirements for genome comparisons in computational biology. We propose to use fuzzy integral[33] to analyze DNA sequences based on a Markov chain[34], which can be categorised as k-mer or word frequency method. The consistency can be seen from the statistical analysis such as AUC (area under the ROC) values, calculated from ROC curves provided in Supplementary Material
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.