Scaling associative classification for very large datasets

Luca Venturini,Paolo Garza,Elena Baralis

doi:10.1186/s40537-017-0107-2

Luca Venturini, Paolo Garza + Show 1 more

Open Access

https://doi.org/10.1186/s40537-017-0107-2

Copy DOI

Journal: Journal of Big Data	Publication Date: Dec 1, 2017
Citations: 18	License type: open-access

Affiliation: Polytechnic University of Turin

Abstract

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.

Highlights

In the recent years, Big Data have received much attention by both the academic and the industrial world, with the aim of fully leveraging the power of the information they hide
We tested two values for measure m, that is the confidence of the matching rules, which is a common choice in associative classifiers, and 1 − support, following the intuition that a rule is the better in labeling the more is rare [20]. g() was chosen among min, max and product, three functions that have the properties of associativity and commutativity, which are important for the distribution of the workload
In this work, we have proposed a technique to scale an associative classifier on very large datasets, namely a Distributed Associative Classifier (DAC)

Summary

Introduction

Big Data have received much attention by both the academic and the industrial world, with the aim of fully leveraging the power of the information they hide. Scalability on the domain dimension is a special concern for the datasets in which most of the features are categorical. Categorical features have their values expressed in a discrete domain, and no concept of ordering or ranking can be assumed. Discrete or discretized features are a special case of categorical features where an order among the values is defined. Each feature is identified by a feature_id, that is set to some value v for each record, or to a null value for not available information. Not available data are represented again with a null value, or not represented at all, as transactions do not have a fixed structure. The two sets are used together with other techniques, like cross-validation, to simulate the behavior of the algorithm towards unlabeled, new data and validate its performance

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scaling associative classification for very large datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Distributed information gain theoretic feature selector using spark
Bakshi Rohit Prasad ... Unmesh Kishor Bendale
-
Bakshi Rohit Prasad, et. al.Bakshi Rohit Prasad ... Unmesh Kishor Bendale
01 Dec 2016
01 Dec 2016

Big data execution time based on Spark Machine Learning Libraries
Anna Karen Gárate-Escamilla ... Amir Hajjam El Hassani
-
Anna Karen Gárate-Escamilla, et. al.Anna Karen Gárate-Escamilla ... Amir Hajjam El Hassani
28 Aug 2019
28 Aug 2019

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
...
arXiv (Cornell University) | VOL. -
, et. al. ...
17 Sep 2022
arXiv (Cornell University) | VOL. -

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
Elham Azhir ... Faheem Khan
Mathematics | VOL. 10
Elham Azhir, et. al.Elham Azhir ... Faheem Khan
26 Sep 2022
Mathematics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scaling associative classification for very large datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data