희소 데이터 집합에서 효율적인 빈발 항목집합 탐사 기법

In-Chang Park,Joong-Hyuk Chang,Won-Suk Lee

doi:10.3745/kipstd.2005.12d.6.817

Abstract

빈발 항목집합 마이닝 분야의 주된 연구 주제는 수행과정에서의 메모리 사용량을 줄이고 짧은 수행 시간에 마이닝 결과 집합을 얻는 것으로서, 빈발항목 탐색을 위한 다수의 방법들은 Apriori 알고리즘에 기반을 둔 다중 탐색 방법들이다. 또한 최대 빈발 패턴의 길이가 길어질수록 마이닝 수행 시간이 급격히 증가되는 단점을 가진다. 이를 극복하기 위해서 이전의 연구에서 마이닝 수행 시간을 단축하기 위한 다양한 방법들이 제안되었다. 하지만, 다수의 이들 방법들은 희소 데이터 집합에서는 다소 비효율적인 성능을 나타낸다. 본 논문에서도 효율적인 빈발항목 탐색 방법을 제안하였다. 먼저 빈발항목 탐색을 위한 새로운 트리 구조인 <TEX>$L_2$</TEX>-tree 구조를 제안하였으며, 더불어 <TEX>$L_2$</TEX>-tree를 이용하여 빈발 항목집합을 탐색하는 <TEX>$L_2$</TEX>-traverse 알고리즘을 제안하였다. <TEX>$L_2$</TEX>-traverse 구조는 길이가 2인 빈발 항목집합 <TEX>$L_2$</TEX>에 기반하여 생성되는 것으로서 크기가 매우 작으며, 이를 활용한 <TEX>$L_2$</TEX>-traverse 알고리즘은 <TEX>$L_2$</TEX>-tree를 단순히 한번 탐색함으로써 전체 빈발 항목집합을 빠른 시간에 구한다. 또한 수행 시간을 보다 단축할 수 있는 방법으로 길이가 3인 빈발 항목집합 <TEX>$L_3$</TEX>가 될 수 없는 <TEX>$L_2$</TEX> 패턴들을 미리 제거하는 <TEX>$C_3$</TEX>-traverse 알고리즘도 제안하였다. 다양한 실험을 통해 제안된 방법들은 특히 <TEX>$L_2$</TEX>가 상대적으로 적은 희소 데이터 집합 환경일 때 기존의 다른 방법들보다 우수함을 검증하였다. The main research problems in a mining frequent itemsets are reducing memory usage and processing time of the mining process, and most of the previous algorithms for finding frequent itemsets are based on an Apriori-property, and they are multi-scan algorithms. Moreover, their processing time are greatly increased as the length of a maximal frequent itemset. To overcome this drawback, another approaches had been actively proposed in previous researches to reduce the processing time. However, they are not efficient on a sparse .data set This paper proposed an efficient mining algorithm for finding frequent itemsets. A novel tree structure, called an <TEX>$L_2$</TEX>-tree, was proposed int, and an efficient mining algorithm of frequent itemsets using <TEX>$L_2$</TEX>-tree, called an <TEX>$L_2$</TEX>-traverse algorithm was also proposed. An <TEX>$L_2$</TEX>-tree is constructed from <TEX>$L_2$</TEX>, i.e., a set of frequent itemsets of size 2, and an <TEX>$L_2$</TEX>-traverse algorithm can find its mining result in a short time by traversing the <TEX>$L_2$</TEX>-tree once. To reduce the processing more, this paper also proposed an optimized algorithm <TEX>$C_3$</TEX>-traverse, which removes previously an itemset in <TEX>$L_2$</TEX> not to be a frequent itemsets of size 3. Through various experiments, it was verified that the proposed algorithms were efficient in a sparse data set.

Full Text