Abstract

AbstractClustering is one of the major operations to analyse genome sequence data. Sophisticated sequencing technologies generate huge DNA sequence data; consequently, the complexity of analysing sequences is also increased. So, there is an enormous need for faster sequence analysis algorithms. Most of the existing tools focused on alignment‐based approaches, which are slow‐paced for sequence comparison. Alignment‐free approaches are more successful for fast clustering. The state‐of‐the‐art methods have been applied to cluster small genome sequences of various species; however, they are sensitive to large size sequences. To subdue this limitation, we propose a novel alignment‐free method called DNA sequence clustering with map‐reduce (DCMR). Initially, MapReduce paradigm is used to speed up the process of extracting eight different types of repeats. Then, the frequency of each type of repeat in a sequence is considered as a feature for clustering. Finally, K‐means (DCMR‐Kmeans) and K‐median (DCMR‐Kmedian) algorithms are used to cluster large DNA sequences by using extracted features. The two variants of proposed method are evaluated to cluster large genome sequences of 21 different species and the results show that sequences are very well clustered. Our method is tested for different benchmark data sets like viral genome, influenza A virus, mtDNA, and COXI data sets. Proposed method is compared with MeshClust, UCLUST, STARS, and ClustalW. DCMR‐Kmeans outperforms MeshClust, UCLUST, and DCMR‐Kmedian with respect to purity and NMI on virus data sets. The computational time of DCMR‐Kmeans is less than STARS, DCMR‐Kmedian, and much less than UCLUST on COXI data set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.