Abstract
Efficient management and analysis of large volumes of data is a demanding task of increasing scientific and industrial importance, as the ubiquitous generation of information governs more and more aspects of human life. In this article, we introduce FML-kNN, a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and is able to operate on very large volumes of data of variable granularity and dimensionality. We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines. The results indicate an overall superiority of our framework in all the performed comparisons. Further, we apply FML-kNN in two motivating uses cases for water demand management, against real-world domestic water consumption data. In particular, we focus on forecasting water consumption using 1-h smart meter data, and extracting consumer characteristics from water use data in the shower. We further discuss on the obtained results, demonstrating the framework’s potential in useful knowledge extraction.
Highlights
During the past few years, new database management and distributed computing technologies have emerged to satisfy the need for systems that can efficiently store and operate on massive volumes of data
We introduce Flink Machine Learning (FML)-k-nearest neighbors (kNN), a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink
We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines
Summary
During the past few years, new database management and distributed computing technologies have emerged to satisfy the need for systems that can efficiently store and operate on massive volumes of data. MapReduce [1], a distributed programming model which maps operations on data elements (i.e., mappers) to several machines (i.e., reducers), set the foundation for this technology trend. This laid the groundwork for the development of open source distributed processing engines such as the Apache Hadoop, Spark and Flink [2], that efficiently implement and extend MapReduce. Flink provides a mechanism for automatic procedure optimization and exhibits better performance on iterative distributed algorithms [3] It exhibits better overall performance, as it processes tasks in a pipelined fashion [4].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.