FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins

Georgios Chatzigeorgakidis,Spiros Skiadopoulos,Sophia Karagiorgou,Spiros Athanasiou

doi:10.1186/s40537-018-0115-x

Georgios Chatzigeorgakidis, Spiros Skiadopoulos + Show 2 more

Open Access

https://doi.org/10.1186/s40537-018-0115-x

Copy DOI

Abstract

Efficient management and analysis of large volumes of data is a demanding task of increasing scientific and industrial importance, as the ubiquitous generation of information governs more and more aspects of human life. In this article, we introduce FML-kNN, a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and is able to operate on very large volumes of data of variable granularity and dimensionality. We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines. The results indicate an overall superiority of our framework in all the performed comparisons. Further, we apply FML-kNN in two motivating uses cases for water demand management, against real-world domestic water consumption data. In particular, we focus on forecasting water consumption using 1-h smart meter data, and extracting consumer characteristics from water use data in the shower. We further discuss on the obtained results, demonstrating the framework’s potential in useful knowledge extraction.

Highlights

During the past few years, new database management and distributed computing technologies have emerged to satisfy the need for systems that can efficiently store and operate on massive volumes of data
We introduce Flink Machine Learning (FML)-k-nearest neighbors (kNN), a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink
We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines

Summary

Introduction

During the past few years, new database management and distributed computing technologies have emerged to satisfy the need for systems that can efficiently store and operate on massive volumes of data. MapReduce [1], a distributed programming model which maps operations on data elements (i.e., mappers) to several machines (i.e., reducers), set the foundation for this technology trend. This laid the groundwork for the development of open source distributed processing engines such as the Apache Hadoop, Spark and Flink [2], that efficiently implement and extend MapReduce. Flink provides a mechanism for automatic procedure optimization and exhibits better performance on iterative distributed algorithms [3] It exhibits better overall performance, as it processes tasks in a pipelined fashion [4].

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Feb 6, 2018
Citations: 22	License type: open-access

R Discovery Prime

R Discovery Prime

FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Big Data Tools: A Survey
...
-
, et. al. ...
14 Jan 2021
14 Jan 2021

Detección de anomalías en grandes volúmenes de datos
Omar Torres-Domínguez ... Diana Martin-Rodríguez
Revista Facultad de Ingeniería | VOL. 28
Omar Torres-Domínguez, et. al.Omar Torres-Domínguez ... Diana Martin-Rodríguez
10 Jan 2019
Revista Facultad de Ingeniería | VOL. 28

Ανάλυση Συναισθημάτων Κειμένου σε Hadoop & Spark

-

01 Mar 2021
01 Mar 2021

Big Data Tools and Techniques: A Roadmap for Predictive Analytics
Ms Ritu Ratra ... Dr Preeti Gulia
International Journal of Engineering and Advanced Technology | VOL. 9
Ms Ritu Ratra, et. al.Ms Ritu Ratra ... Dr Preeti Gulia
30 Dec 2020
International Journal of Engineering and Advanced Technology | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data