Abstract
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Neural Networks and Learning Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.