Abstract

Large volume data streams with concept drift have garnered a great deal of attention in the machine learning community. Numerous researchers have proposed online learning algorithms that train iteratively from new observations, and provide continuously relevant predictions. Compared to previous offline, or sliding window approaches, these algorithms have shown better predictive performance, rapid detection of, and adaptation to, concept drift, and increased scalability to high volume or high velocity data. Online Random Forest (ORF) is one such approach to streaming classification problems. We adapted the feature importance metrics of Mean Decrease in Accuracy (MDA) and Mean Decrease in Gini Impurity (MDG), both originally designed for offline Random Forest, to Online Random Forest so that they evolve with time and concept drift. Our work is novel in that previous streaming models have not provided any measures of feature importance. We experimentally tested our Online Random Forest versions of feature importance against their offline counterparts, and concluded that our approach to tracking the underlying drifting concepts in a simulated data stream is valid.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call