With the development of information technology and internet, various types of information are increasing explosively. It is still a challenge to discover knowledge from massive information. As a pivotal technology to obtain knowledge, data mining has attracted a large amount of research interest for several decades; however, when dealing with large-scale data, most of previous works are still not as efficient as expected. Therefore, the extension of algorithms to deal with large-scale data and the improvement of executing efficiency have become important issues in data mining. Cloud computing based data mining has become a hot topic recently. In this paper, we develop a parallel and distributed data mining toolkit platform (PDMiner) based on large-scale data processing platform—Hadoop. In PDMiner, we propose to implement various data mining operations, such as data preprocessing, association rule analysis, classification and clustering in a parallel manner. The experimental results show that these parallel algorithms 1) can tackle large-scale data set, up to terabyte; 2) are very high efficiency, since they have good speedup; 3) are easily extended to execute in a cluster of commodity machines, which can make full use of computing resource; 4) are efficient for practical data mining. Additionally, we develop knowledge flow subsystem, which can facilitate the user to define data mining task in PDMiner. Furthermore, we can conveniently integrate new parallel algorithms into PDMiner through flexible interface.
Read full abstract