A framework for data-parallel knowledge discovery in databases

A.A Freitas

doi:10.1049/ic:19961111

Abstract

Despite the great demand for KDD (knowledge discovery in databases) in large database and data warehouse systems, in general KDD algorithms have been applied to relatively small data samples and do not have any integration at all with relational DBMS. The application of KDD algorithms to large databases faces serious scalability problems, particularly concerning unacceptably long processing times. This paper proposes a framework for data-parallel KDD, aiming mainly at improving the efficiency and scalability of KDD algorithms. The approach is based on generic, context-free, set-oriented primitives. The primitives are generic in the sense that they capture the core operations underlying a number of KDD algorithms. This is important because no single algorithm can be expected to perform well across all domains. Moreover, the primitives are set-oriented, i.e. they perform operations on data elements independently of the order of those elements. This allows us to efficiently exploit data parallelism on cost-effective parallel database servers through SQL database queries. (4 pages)

Full Text