Abstract

In the specialized literature, researchers can find a large number of proposals for solving regression problems that come from different research areas. However, researchers tend to use only proposals from the area in which they are experts. This paper analyses the performance of a large number of the available regression algorithms from some of the most known and widely used software tools in order to help non-expert users from other areas to properly solve their own regression problems and to help specialized researchers developing well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments. To sum up, we have analyzed 164 algorithms that come from 14 main different families available in 6 software tools (Neural Networks, Support Vector Machines, Regression Trees, Rule-Based Methods, Stacking, Random Forests, Model trees, Generalized Linear Models, Nearest Neighbor methods, Partial Least Squares and Principal Component Regression, Multivariate Adaptive Regression Splines, Bagging, Boosting, and other methods) over 52 datasets. A new measure has also been proposed to show the goodness of each algorithm with respect to the others. Finally, a statistical analysis by non-parametric tests has been carried out over all the algorithms and on the best 30 algorithms, both with and without bagging. Results show that the algorithms from Random Forest, Model Tree and Support Vector Machine families get the best positions in the rankings obtained by the statistical tests when bagging is not considered. In addition, the use of bagging techniques significantly improves the performance of the algorithms without excessive increase in computational times.

Highlights

  • Regression is one of the most classic statistical techniques for predictive data mining [1]

  • Non-expert users from other areas could properly solve their own regression problems and specialized researchers could develop well-founded future proposals by properly comparing and identifying algorithms that will enable them to focus on significant further developments

  • We have analyzed 164 regression algorithms that come from 14 different families (Neural Networks, Support Vector Machines, Regression Trees, Rule-Based Methods, Stacking, Random Forests, Model trees, Generalized Linear Models, Nearest Neighbor methods, Partial Least Squares and Principal Component Regression, Multivariate Adaptive Regression Splines, Bagging, Boosting, and Other Methods) and that are available in the software tools Java Statistical Analysis Tool (JSAT) [9], KEEL [10], Matlab [5], R [6], Scikit-learn [11] and Weka [7], [8]

Read more

Summary

Introduction

Regression is one of the most classic statistical techniques for predictive data mining [1]. When new approaches are published in any of these areas, researchers usually tend to use the same category of algorithms historically applied in the area of research in which they are experts, probably due to their partial knowledge about the available algorithms. This problem is made worse because of only a few number of researchers make the software and/or source code associated with their proposals public and sometimes authors provide vague or even ambiguous descriptions in the specialized literature. This issue, along with the high complexity of some proposals, makes the widespread use of

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call