Cluster Analysis to Find Sets of High-frequency Queries for Filtering in Similarity Join

Kamolwan Kunanusont,Jaruloj Chongstitvatana

doi:10.37936/ecti-cit.2016101.58158

Kamolwan Kunanusont, Jaruloj Chongstitvatana

Open Access

https://doi.org/10.37936/ecti-cit.2016101.58158

Copy DOI

Abstract

Similarity search and similarity join are important operations in text databases. In some situations, some similar queries, called high-frequency queries, are repeated over a period of time. High-frequencyqueries-based filter is used to facilitate this type of queries. However, the performance of this method depends mostly on the chosen high-frequency queries. This paper proposes methods, which are based on DBSCAN and agglomerative hierarchical-based clustering algorithm, to find high-frequency queries for the filter, called DBRAN and DBSM. For evaluation, both DBRAN and DBSM are applied on various sets of queries to find high-frequency queries for three datasets. It is found that DBSM performs better than DBRAN when the variation among highfrequency queries is high. However, when the variation among high-frequency queries is low, the performance of both DBRAN and DBSM are about the same.

Full Text