Structural XML Classification in Concept Drifting Data Streams

Dariusz Brzezinski,Maciej Piernik

doi:10.1007/s00354-015-0401-5

Abstract

Classification of large, static collections of XML data has been intensively studied in the last several years. Recently however, the data processing paradigm is shifting from static to streaming data, where documents have to be processed online using limited memory and class definitions can change with time in an event called concept drift. As most existing XML classifiers are capable of processing only static data, there is a need to develop new approaches dedicated for streaming environments. In this paper, we propose a new classification algorithm for XML data streams called XSC. The algorithm uses incrementally mined frequent subtrees and a tree-subtree similarity measure to classify new documents in an associative manner. The proposed approach is experimentally evaluated against eight state-of-the-art stream classifiers on real and synthetic data. The results show that XSC performs significantly better than competitive algorithms in terms of accuracy and memory usage.

Full Text