Cluster Splitting Based High Dimensional Metric Space Index B&lt;SUP&gt;+&lt;/SUP&gt;-Tree

Jun-Qi Zhang

doi:10.3724/sp.j.1001.2008.01401

Abstract

为了提高索引性能,高维度量空间索引通常采用K-Means等聚类技术来获取数据的分布信息.但是,已知的工作需要根据经验来确定聚类参数,缺乏对聚类与查询性能之间关系的理论分析.提出了一种基于聚类分解的高维度量空间B<sup>+</sup>-tree索引,通过聚类分解,对数据进行更细致的划分来减少查询的数据访问.对聚类与查询代价的关系进行了讨论,通过查询代价模型,给出了最小查询代价条件下的聚类分解数目等理论的计算方法.实验显示,提出的索引方法明显优于iDistance等度量空间索引,最优聚类分解数的估计接近实际最优查询时所需的聚类参数.;In order to improve the query efficiency, K-means cluster approach is often used to estimate the data distribution in the context of high dimensional metric space index. But in previous work, the parameters of clustering are usually selected according to some heuristic manner. This paper presents a new high dimensional index approach—cluster splitting based high dimensional B<sup>+</sup>-tree. Through cluster splitting, the data space is partitioned more finely to reduce the cost of data access. The relationship between cluster and the query cost is discussed, and based on the query cost model, this paper give formulas to compute the optimal parameters of the cluster which can minimize the query cost in theory. Experiment results show that the efficiency of the methods is better than iDistance, M-Tree and sequence scan, and the parameters computed by the formulas are very close to the real optimal one.

Full Text