Seamless Integration of Data Mining with DBMS and Applications

Hongjun Lu

doi:10.1007/3-540-45357-1_3

Abstract

Data mining has been widely recognized as a powerful tool for exploring added value from data accumulated in the daily operations of an organization. A large number of data mining algorithms have been developed during the past decade. Those algorithms can be roughly divided into two groups. The fist group of techniques, such as classification, clustering, prediction and deviation analysis, has been studied for a long time in machine learning, statistics, and other fields. The second group of techniques, such as association rule mining, mining in spatial-temporal databases and mining from the Web, addresses problems related to large amounts of data. Most classical algorithms in the first group assume that the data to be mined is somehow available in memory. Although initial effort in data mining has concentrated on making those algorithms scalable with respect to large volume of data, most of those scalable algorithms, even developed by database researchers, are still stand-alone. It is often assumed that data is available in desired forms, without considering the fact that most organizations store their data in databases managed by database management systems (DBMS). As such, most data mining algorithms can only be loosely coupled with data infrastructures in organizations and are difficult to infuse into existing mission-critical applications. Seamlessly integrating data mining techniques with database applications and database management systems remains an open problem. In this paper, we propose to tackle the problem of seamless integration of data mining with DBMS and applications from three directions. First, with the recent development of database technology, most database management systems have extended their functionality in data analysis. Such capability should be fully explored to develop DBMS-awre data mining algorithms. Ideally, data mining algorithms can be fully implemented using DBMS supported functions so that they become database application themselves. Second, major difficulties in integrating data mining with applications are algorithm selection and parameter setting. Reducing or eliminating mining parameters as much as possible and developing automatic or semi-automatic mining algorithm selection techniques will greatly increase the application friendliness of data mining systems. Lastly, standardizing the interface among databases, data mining algorithms and applications can also facilitate the integration to certain extent.

Full Text