Sequence Clustering Algorithms Based on Global and Local Similarity

Dong-Bo Dai,Yun Xiong

doi:10.3724/sp.j.1001.2010.03609

Abstract

现有的很多序列聚类算法是基于“局部特征可以表征整个序列”的假设来进行的,即不区分实际应用中序列的整体相似性和局部相似性.这对存在保守子模式的序列,如DNA和蛋白质序列是适用的,但对一些注重整体序列相似性的应用领域,如:在交易数据库中用户购买行为的比较,时间序列数据中全局模式的匹配等,由于难以产生频繁子模式,用基于全局相似性的度量方法进行聚类显得更为合理.此外,在基于局部相似性的序列聚类算法中,选取的局部子模式表征序列的能力也有待进一步提高.由此,针对不同应用领域,分别提出基于整体相似性的序列聚类算法GSClu和基于局部相似性的序列聚类算法LSClu.GSClu和LSClu分别利用带剪枝策略的二分k均值算法和基于有gap约束的强区分度子模式方法对各自领域的序列数据进行聚类.实验采用交易序列数据和蛋白质序列数据,实验结果表明,GSClu和LSClu对各自领域的序列数据具有较快的处理速度和良好的聚类质量.;Many current sequence clustering algorithms are based on the hypothesis that sequence can be characterized by its local features, without differentiating between global similarity and local similarity of sequences in different applications, which is applicable to biological sequences such as DNA and protein with conserved sub-patterns. However, in some domains such as the comparison of customers’ purchase behaviors in retail transaction database and the pattern match in time series data, due to difficulties in forming frequent sub-pattern, it is more reasonable to cluster these sequence data based on global similarity. Besides, among sequence clustering algorithms based on local similarity, the ability that sub-patterns characterize sequence should be improved. So, this paper proposes two clustering algorithms, GSClu (global similarity clustering) and LSClu (local similarity clustering), for different application fields, based on global and local similarity respectively. GSClu uses bisecting k-means technique and CSClu adopts sub-patterns with gap constraint to cluster the sequence data of corresponding application field. Sequence data in the experiments include retail transaction data and protein data. The experimental results show that GSClu and LSClu are of fast processing rate and high clustering quality.

Full Text