Abstract

Abstract The aim of the Stat Log project is to compare the performance of statistical, machine learning, and neural network algorithms, on large real world problems. This paper describes the completed work on classification in the Stat Log project. Classification is here defined to be the problem, given a set of multivariate data with assigned classes, of estimating the probability from a set of attributes describing a new example sampled from the same source that it has a pre-defined class. We gathered together a representative collection of algorithms from statistics (Naive Bayes, K-nearest Neighbour, Kernel density, Linear discriminant, Quadratic discriminant, Logistic regression, Projection pursuit, Bayesian networks), machine learning (CART, C4.5, NewID, AC2, CAL5, CN2, ITrule —only propositional symbolic algorithms were considered), and neural networks (Backpropagation, Radial basis functions, Kohonen). We then applied these algorithms to eight large real world classification problems: four from image analysis, two from medicine, and one each from engineering and finance. Our results are still provisional, but we can draw a number of tentative conclusions about the applicability of particular algorithms to particular database types. For example: we found that K-nearest Neighbour can perform well on complex image analysis problems if the attributes are properly scaled, but it is very slow; machine learning algorithms are very fast and robust to non-Normal features of databases, but may be out-performed if particular distribution assumptions hold. We additionally found that many classification algorithms need to be extended to deal better with cost functions (problems where the classes have an ordered relationship are a special case of this).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call