Enhanced Data Lake Clustering Design based on K-means Algorithm

Jabrane Kachaoui,Abdessamad Belangour

doi:10.14569/ijacsa.2020.0110472

Abstract

In recent years, Big Data requirements have evolved. Organizations are trying more than ever to accent their efforts on industrial development of all data at their disposal and move further away from underpinning technologies. After investing around Data Lake concept, organizations must now overhaul their data architecture to face IoT (Internet of Things) and AI (Artificial Intelligence) expansion. Efficient and effective data mapping treatments could serve in understanding the importance of data being transformed and used for decision-making process endorsement. As current relational databases are not able to manage large amounts of data, organizations headed towards NoSQL (Not only Structured Query Language) databases. One such known NoSQL database is MongoDB, which has a high scalability. This article mainly put forward a new data model able to extract, classify, and then map data for the purpose of generating new more structured data that meet organizational needs. This can be carried out by calculating various metadata attributes weights, which are considered as important information. It also processed on data clustering stored into MongoDB. This categorization based on data mining clustering algorithm named K-Means.

Highlights

Around the world, organizations are looking for a complete data analytics solution to cut costs, accelerate development cycles, and provide valuable information to solve certain of their biggest organizational problems
On the entire data collected from Data Lake and stored in MongoDB, K-Means algorithm is applied for data classification and clustering
A K-means clustering with the standard model was executed, a K-means clustering based on this study developed model was executed

Summary

INTRODUCTION

Organizations are looking for a complete data analytics solution to cut costs, accelerate development cycles, and provide valuable information to solve certain of their biggest organizational problems. They view their data assets as an engine driving economic activity for competitive edge. It becomes difficult to place confidence in its accuracy and veracity as well as to use it carefully [3] [4] To solve this problem, organizations have implemented systems with a clustering strategy. This paper concentrates on various data sources centralized in Data Lake and analyzes them based on a common targeted schema [8] These data are collected and mapped into NoSQL database named MongoDB. On the entire data collected from Data Lake and stored in MongoDB, K-Means algorithm is applied for data classification and clustering

Objectives and Contribution

RELATED WORKS

Metadata Analysis

K-means Algorithm

SPECIFIC OBJECTIVES OF PROPOSAL SYSTEM

Data Flow Diagram

Servers Availability Process

IMPLEMENTATION AND EVALUATION

Running MongoDB

Running K-means Algotithm based on Metadata

CONCLUSION AND FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2020
Citations: 7	License type: cc-by

R Discovery Prime

R Discovery Prime

Enhanced Data Lake Clustering Design based on K-means Algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Concurrency versus consistency in NoSQL databases
Sonal Kanungo ... Rustom D Morena
Journal of Autonomous Intelligence | VOL. 7
Sonal Kanungo, et. al.Sonal Kanungo ... Rustom D Morena
28 Dec 2023
Journal of Autonomous Intelligence | VOL. 7

A Comparative Study of NoSQL Databases
...
international journal of advanced research in computer science | VOL. 5
, et. al. ...
01 Jan 2014
international journal of advanced research in computer science | VOL. 5

An Empirical Study of NoSQL Databases for Big Data
Wen-Chen Hu ... Naima Kaabouch
-
Wen-Chen Hu, et. al.Wen-Chen Hu ... Naima Kaabouch
01 Jan 2015
01 Jan 2015

Hosting and Delivering Cassandra NoSQL Database via Cloud Environments
Skylab Reddy ... Pethuru Raj
-
Skylab Reddy, et. al.Skylab Reddy ... Pethuru Raj
19 May 2017
19 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhanced Data Lake Clustering Design based on K-means Algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications