Abstract

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.

Highlights

  • In the recent years, Big Data have received much attention by both the academic and the industrial world, with the aim of fully leveraging the power of the information they hide

  • We tested two values for measure m, that is the confidence of the matching rules, which is a common choice in associative classifiers, and 1 − support, following the intuition that a rule is the better in labeling the more is rare [20]. g() was chosen among min, max and product, three functions that have the properties of associativity and commutativity, which are important for the distribution of the workload

  • In this work, we have proposed a technique to scale an associative classifier on very large datasets, namely a Distributed Associative Classifier (DAC)

Read more

Summary

Introduction

Big Data have received much attention by both the academic and the industrial world, with the aim of fully leveraging the power of the information they hide. Scalability on the domain dimension is a special concern for the datasets in which most of the features are categorical. Categorical features have their values expressed in a discrete domain, and no concept of ordering or ranking can be assumed. Discrete or discretized features are a special case of categorical features where an order among the values is defined. Each feature is identified by a feature_id, that is set to some value v for each record, or to a null value for not available information. Not available data are represented again with a null value, or not represented at all, as transactions do not have a fixed structure. The two sets are used together with other techniques, like cross-validation, to simulate the behavior of the algorithm towards unlabeled, new data and validate its performance

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.