Soft and Declarative Fishing of Information in Big Data Lake

Bozena Malysiak-Mrozek,Marek Stabla,Dariusz Mrozek

doi:10.1109/tfuzz.2018.2812157

Abstract

In recent years, many fields that experience a sudden proliferation of data, which increases the volume of data that must be processed and the variety of formats the data is stored in have been identified. This causes pressure on existing compute infrastructures and data analysis methods, as more and more data are considered as a useful source of information for making critical decisions in particular fields. Among these fields exist several areas related to human life, e.g., various branches of medicine, where the uncertainty of data complicates the data analysis, and where the inclusion of fuzzy expert knowledge in data processing brings many advantages. In this paper, we show how fuzzy techniques can be incorporated in big data analytics carried out with the declarative U-SQL language over a big data lake located on the cloud. We define the concept of big data lake together with the Extract, Process, and Store process performed while schematizing and processing data from the Data Lake, and while storing results of the processing. Our solution, developed as a Fuzzy Search Library for Data Lake, introduces the possibility of massively parallel, declarative querying of big data lake with simple and complex fuzzy search criteria, using fuzzy linguistic terms in various data transformations, and fuzzy grouping. Presented ideas are exemplified by a distributed analysis of large volumes of biomedical data on Microsoft Azure cloud. Results of performed tests confirm that the presented solution is highly scalable on the Cloud and is a successful step toward soft and declarative processing of data on a large scale. The solution presented in this paper directly addresses three characteristics of big data, i.e., volume , variety , and velocity , and indirectly addresses, veracity and value .

Full Text