Abstract

With the discovery of new DNAs, a fundamental problem arising is how to categorize those DNA sequences into correct species. Unfortunately, identifying all data groups correctly and assigning a set of DNAs into k clusters where k must be predefined are one of the major drawbacks in clustering analysis, especially when the data have many dimensions and the number of clusters is too large and hard to guess. Furthermore, finding a similarity measure that preserves the functionality and represents both the composition and distribution of the bases in a DNA sequence is one of the main challenges in computational biology. In this paper, a new soft computing metaheuristic framework is introduced for automatic clustering to generate the optimal cluster formation and to determine the best estimate for the number of clusters. Pulse coupled neural network (PCNN) is utilized for the calculation of DNA sequence similarity or dissimilarity. Bat algorithm is hybridized with the well-known genetic algorithm to solve the automatic data clustering problem. Extensive computational experiments are conducted on the expanded human oral microbiome database (eHOMD). A comparative study between the experimental results shows that the proposed hybrid algorithm achieved superior performance over the standard genetic algorithm and bat algorithm. Moreover, the hybrid performance was compared with competing algorithms from the literature review to ascertain its superiority. Mann-Whitney-Wilcoxon rank-sum test is conducted to statistically validate the obtained clusters.

Highlights

  • The clustering problem is an unsupervised problem, which aims at assigning similar groups together to discover unlabeled similar structures in data without any prior knowledge [1] [2]

  • We propose a new chromosome design that can identify the optimal number of clusters for variable-length chromosomes without any prior knowledge

  • It provides information about bacterial species found in the human aerodigestive tract (ADT) including the nasal passages, sinuses, throat, esophagus, mouth, and lower respiratory tract. expanded human oral microbiome database (eHOMD) includes a total of 775 microbial species and more than 1,000 microbial DNAs

Read more

Summary

INTRODUCTION

The clustering problem is an unsupervised problem, which aims at assigning similar groups together to discover unlabeled similar structures in data without any prior knowledge [1] [2]. A new method based on pulse coupled neural network introduced by Xin Jin et al [18] is applied to find similarity or dissimilarity of DNA sequences where DNA is transformed into a numeral sequence using four number mapping schemes representing the DNA effectively without losing any genetic information. It processes on DNAs with several sizes taking into consideration the local and global features; it is adopted.

SOFT COMPUTING TECHNIQUES
RELATED WORK
PROPOSED SYSTEM
ENTROPY OF DNA SEQUENCES
CLUSTERING WITH GENETIC ALGORITHM
CLUSTERING WITH BAT ALGORITHM
DATA SET DESCRIPTION
SYSTEM CONFIGURATION AND PARAMETER SETTING
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call