Abstract
With the dramatic rise of mobile internet users and the administrative requirements of long-term data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times.
Highlights
Nowadays, the mobile communication network has become an indispensable part of people’s daily life
We propose multiple key columns sorting under the dual-column bucketing (DCB) scheme
PRELIMINARIES The relevant Hive techniques involved in this paper provide essential support for the implementation of our DCB scheme
Summary
The mobile communication network has become an indispensable part of people’s daily life. In [10], the authors proposed PageFile, a hybrid page-based storage structure on the MapReduce framework It has faster query processing, better disk space utility compared to Hive’s RCFile [11] on the TPC-H data set. Researchers mentioned above contribute greatly from file-level optimizations including compression algorithm, the structure of file format, and record analysis to improve storage efficiency and retrieval performance on popular data sets, they do not focus on scheme optimization and target CDRs which is a special data set. Performance for linear searching on tons of CDRs is poor no matter how you partition the table or adopt the file-level or record-level contributions mentioned in Section I.A. we need to design a scheme fulfilling the following two key requirements to tackle the query problem.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.