Abstract

With the dramatic rise of mobile internet users and the administrative requirements of long-term data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times.

Highlights

  • Nowadays, the mobile communication network has become an indispensable part of people’s daily life

  • We propose multiple key columns sorting under the dual-column bucketing (DCB) scheme

  • PRELIMINARIES The relevant Hive techniques involved in this paper provide essential support for the implementation of our DCB scheme

Read more

Summary

INTRODUCTION

The mobile communication network has become an indispensable part of people’s daily life. In [10], the authors proposed PageFile, a hybrid page-based storage structure on the MapReduce framework It has faster query processing, better disk space utility compared to Hive’s RCFile [11] on the TPC-H data set. Researchers mentioned above contribute greatly from file-level optimizations including compression algorithm, the structure of file format, and record analysis to improve storage efficiency and retrieval performance on popular data sets, they do not focus on scheme optimization and target CDRs which is a special data set. Performance for linear searching on tons of CDRs is poor no matter how you partition the table or adopt the file-level or record-level contributions mentioned in Section I.A. we need to design a scheme fulfilling the following two key requirements to tackle the query problem.

HIVE BUCKET
DATA SET DESCRIPTION
RELATED WORK
Findings
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call