Abstract

The random forest classifier is widely used in different fields due to its accuracy and robustness. Since its invention, the random forest algorithm is naturally developed for multi-dimensional vectorial data since features can be directly sampled during the decision tree construction procedure. In the context of discrete sequence classification, an explicit feature set is not readily available and we need to employ a feature extraction algorithm before building the random forest. However, such a predefined feature subset may limit the diversity of decision trees since the set of candidate features is composed of all subsequences. As a result, the predictive accuracy of constructed random forest classifier may be reduced. To address this, we propose a new algorithm that is able to directly build a random forest by choosing features from the set of all subsequences adaptively. To improve the running efficiency of our algorithm, the count-suffix tree is utilized to facilitate the fast frequency counting of subsequences so as to accelerate the generation of each randomized decision tree. The experimental results on 15 real datasets show that our method can outperform those state-of-the-art classification algorithms in terms of the predictive accuracy. The source code of our method can be found at: https://github.com/JiaqiWang-dlut/RSForest.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call