Tree-Structured Data Processing Platform for Large-Scale Data Mining

Kohsuke Yanai,Sagawa Nobutoshi,Ryoichi Ueda

doi:10.1527/tjsai.26.594

Kohsuke Yanai, Sagawa Nobutoshi + Show 1 more

Open Access

https://doi.org/10.1527/tjsai.26.594

Copy DOI

Abstract

We propose a data processing platform that can analyze a large amount of tree-structured data. The proposed platform stores tree-structured data in separated files corresponding to each attribute, and uses MapReduce framework for distributed computing. These methods enable to reduce disk I/O load, and to avoid computationally-intensive processing, such as grouping or combining of records. An early stage of data mining needs try-and-error processes to find out how to analyze and utilize the data. Our platform speeds up computations of the try-and-error processes, such as appending new attributes and calculating statistics of attributes. Experimental results show that the proposed methods are efficient to process large-scale tree-structure data, and our platform is comparable or superior to a traditional relational database system. With the proposed platform, it became possible to process 90 GB data within 5 minutes on 6 benchmark tasks. We also describe system architecture for the try-and-error phase, which integrates the proposed platform and a few Web applications. The main contributions of this paper are: (1) formulation of vertical partitioning for tree-structured data, (2) effective utilization of MapReduce, and (3) construction of large-scale data mining system for a try-and-error phase.

Full Text