Abstract

One of the most common tasks nowadays in big data environments is the need to classify large amounts of data. There are numerous classification models designed to perform best in different environments and datasets, each with its advantages and disadvantages. However, when dealing with big data, their performance is significantly degraded because they are not designed—or even capable—of handling very large datasets. The current approach is based on a novel proposal of exploiting the dynamics of skyline queries to efficiently identify the decision boundary and classify big data. A comparison against the popular k-nearest neighbor (k-NN), support vector machines (SVM) and naïve Bayes classification algorithms shows that the proposed method is faster than the k-NN and the SVM. The novelty of this method is based on the fact that only a small number of computations are needed in order to make a prediction, while its full potential is revealed in very large datasets.

Highlights

  • The increased amount of high-volume, high-velocity, high-variety and high-veracity data produced in the last decade has created the need to develop cost-effective techniques to manage them, which fall under the term big data [1]

  • Single curve with Polynomial Curve-fitting: Throughout our experimental phase, we observed that many and in some cases even all of the skyline points are a subset of the support vectors used by the final support vector machines (SVM) (Figure 8a)

  • For the synthetic datasets that consist of 1 M points the naïve Bayes approach finished in less than a second, the SVM took several minutes (Table 2 with time in milliseconds) and the k-nearest neighbor (k-NN) did not finish in a reasonable time

Read more

Summary

Introduction

The increased amount of high-volume, high-velocity, high-variety and high-veracity data produced in the last decade has created the need to develop cost-effective techniques to manage them, which fall under the term big data [1]. ML methods have reached a point at which we can combine even a set of weak classifiers using ensemble learning techniques [12] to produce good results With this in mind, each time a new classifier is proposed, questions arise if we really need one more [13]. Each time a new classifier is proposed, questions arise if we really need one more [13] Even with these techniques, it is not always feasible to perform a classification task with low processing costs in a big data environment, since traditional classification algorithms are designed primarily to achieve exceptional accuracy with tradeoffs between space or time complexity. The problem is equivalent to the maximal vector problem [17] To our knowledge, this is the first work that tries to harvest the power of skyline queries in a classification process for big data.

Background and Related Work
Skyline Query Family
Applications of Skyline Queries
Big Data
Preliminaries
Skyline dataset in in Table
Cardinality
Define the Origin Points
Identifying Skyline Points
Decision
Classification Task
Experiments
Synthetic constructed to have a large number of points in order to
Method
Synthetic Dataset II
Method kk
16. Polynomial curve-fitting on on the the synthetic synthetic Dataset
Synthetic
11. Dataset
Real Dataset
Future Work
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.