SharkDB: an in-memory storage system for large scale trajectory data management

Haozhou Wang

doi:10.14264/uql.2016.101

Abstract

The rapid development of location-based technologies including animal tracking sensors, GPS devices embedded in taxis and buses, and smart phones carried by people has quickly led to the capability of collecting spatio-temporal information about almost any kind of moving object, resulting in huge volumes of spatio-temporal data in the form of trajectories. Business intelligence now has more interest in analysing large amounts of trajectory data rather than the data on hard disk, since querying on such trajectory data can reveal useful information. Due to high access latency and low I/O operations of hard disks, the disk-based storage systems (with traditional data structures) have been challenged by modern applications (e.g. location-based services), which require real time responses when querying large scale trajectory datasets. Therefore, novel data structures and query algorithms need to be designed to meet this requirement. In this thesis, a series of concrete and challenging problems about storing, managing and analysing large scale trajectory data are studied. A complete in-memory column-oriented storage system called SharkDB is implemented to address these problems and support real time computing for trajectory queries. Below is a brief description of contributions. First of all, a preliminary study has been conducted to identify the trajectory synchronisation problem on large scale trajectory dataset. Based on this observation, a novel data structure, which is called a frame based data structure, is proposed to synchronise trajectories based on their temporal information. Meanwhile, to improve performance of trajectory queries, the frame based data structure is optimised by implementing this data structure into main memory with compression and CPU cache-optimisation techniques. After implementing a frame based data structure, challenges with regard to trajectory query processing are investigated. To address these challenges, the trajectory queries are divided into three categories, i.e. basic operations, advanced operations and analytic operations. For each category, a naive algorithm is proposed first. To improve the query performance, for the category of basic operations, a parallel computing technique is used to speed up the running time of the query. For the category of advanced operations, a hierarchical I/P frame structure based approach is proposed. For the category of analytic operations, a MBR+KMP algorithm is presented. To evaluate SharkDB, a comprehensive experimental study including operation level evaluation and system level evaluation is conducted. In the operation level evaluation, query processing using the proposed algorithms are compared to a traditional trajectory data structure. The extensive experiments demonstrate that the newly designed algorithms can guarantee real-time trajectory query processing on large scale trajectory dataset. A set of workload models that reflect real world workloads is proposed in the system level evaluation. The experiments on such workload models also verify the superiority of SharkDB over traditional data structures. Finally, as in-memory database management systems are receiving more attention today, some commercial in-memory database management systems have been released. Hence, in the collaboration with SAP, SharkDB is migrated into SAP HANA. To achieve this, the data structures of SharkDB are re-designed to suit the architecture of SAP HANA. A set of experiments are conducted to show that SharkDB can beat other popular traditional data structures in regard to trajectory query processing.

Full Text