A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Yun Li,Greguska Frank,Thomas Huang,Mingyue Lu,David Moroni,Juan Gu,Chaowei Yang,Lewis Mcgibbney,Yongyao Jiang,Edward Armstrong,Manzhu Yu

doi:10.3390/app9061114

Yun Li, Greguska Frank + Show 9 more

Open Access

PDF Available

https://doi.org/10.3390/app9061114

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

Highlights

IntroductionAs the volume and variety of the data are increasing faster than ever, they pose a great challenge for SDPs to provide reliable and quality service [2]
Spatial data portals (SDPs) serve the Earth science community with massive geospatial data [1].as the volume and variety of the data are increasing faster than ever, they pose a great challenge for SDPs to provide reliable and quality service [2]
The MUDROD engine manages logs and converts them into a series of sessions and domain-knowledge-like oceanographic vocabulary linkages based on Elasticsearch [4,5]

Summary

Introduction

As the volume and variety of the data are increasing faster than ever, they pose a great challenge for SDPs to provide reliable and quality service [2]. An emerging trend in data discovery is mining user behaviors from logs for the latent linkages between users and data [3]. Archive Center (PO.DAAC) as an example, a solution was proposed to improve oceanography data discovery and access by mining user behavior data, called the Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) [2]. The MUDROD engine manages logs and converts them into a series of sessions and domain-knowledge-like oceanographic vocabulary linkages based on Elasticsearch [4,5]. Elasticsearch is a component of the ELK stack and provides a solution to automatically index, search, analyze, and visualize logs with Logstash and Kibana

Methods

Results

Conclusion