Filtering and Storing User Preferred Data: an Apache Spark Based Approach

Bannya Chanda,Shikharesh Majumdar

doi:10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00115

Abstract

This work-in-progress paper focuses on a filtering technique based on user preferences. It uses parallel processing and machine learning to effectively filter out user preferred data from a large raw data set. Although large volumes of data are generated, a user is often interested in only a select type (classes) of such data. The motivation behind this research is to devise an effective and efficient filtering technique for extracting user preferred data from large data sets. Storing only filtered data and discarding the remaining data can decrease latency in searching for specific information within a data set. It can also decrease the size of the storage required for storing these data. Such a filtering method that uses data classification techniques can give rise to high processing latencies. An algorithm and system that use both parallel processing and machine learning are presented. A proof-of-concept prototype is built on the Apache Spark parallel processing platform. Analysis of the results of preliminary experiments demonstrates the viability of the investigated technique.

Full Text