A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

Xi Peng,Liang Liu,Lei Zhang

doi:10.1109/access.2019.2961692

Xi Peng, Liang Liu + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2961692

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 2	License type: CC BY 4.0

Affiliation: China Telecom (China), Sichuan University

Abstract

With the dramatic rise of mobile internet users and the administrative requirements of long-term data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times.

Highlights

Nowadays, the mobile communication network has become an indispensable part of people’s daily life
We propose multiple key columns sorting under the dual-column bucketing (DCB) scheme
PRELIMINARIES The relevant Hive techniques involved in this paper provide essential support for the implementation of our DCB scheme

Summary

INTRODUCTION

The mobile communication network has become an indispensable part of people’s daily life. In [10], the authors proposed PageFile, a hybrid page-based storage structure on the MapReduce framework It has faster query processing, better disk space utility compared to Hive’s RCFile [11] on the TPC-H data set. Researchers mentioned above contribute greatly from file-level optimizations including compression algorithm, the structure of file format, and record analysis to improve storage efficiency and retrieval performance on popular data sets, they do not focus on scheme optimization and target CDRs which is a special data set. Performance for linear searching on tons of CDRs is poor no matter how you partition the table or adopt the file-level or record-level contributions mentioned in Section I.A. we need to design a scheme fulfilling the following two key requirements to tackle the query problem.

HIVE BUCKET

DATA SET DESCRIPTION

RELATED WORK

Findings

CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Understanding the unobservable population in call detail records through analysis of mobile phone user calling behavior: A case study of Greater Dhaka in Bangladesh
Ayumi Arai ... Teerayut Horanont
-
Ayumi Arai, et. al.Ayumi Arai ... Teerayut Horanont
01 Mar 2015
01 Mar 2015

CDR analysis using Big Data technology
Sara B Elagib ... R F Olanrewaju
-
Sara B Elagib, et. al.Sara B Elagib ... R F Olanrewaju
01 Sep 2015
01 Sep 2015

Chapter Seven - How To Get Call Detail and Cell Tower Records
Larry Daniel
Cell Phone Location Evidence for Legal Professionals | VOL. -
Larry DanielLarry Daniel
01 Jan 2017
Cell Phone Location Evidence for Legal Professionals | VOL. -

Enriching sparse mobility information in Call Detail Records
Guangshuo Chen ... Carlos Sarraute
Computer Communications | VOL. 122
Guangshuo Chen, et. al.Guangshuo Chen ... Carlos Sarraute
15 Mar 2018
Computer Communications | VOL. 122

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access