A distributed in-memory database system for large-scale spatial-temporal trajectory data

Douglas Alves Peixoto

doi:10.14264/uql.2019.725

Abstract

Spatial-temporal trajectory data contains rich information about moving objects and phenomena, hence have been widely used for a great number of real-world applications. However, the ubiquity and complexity of spatial-temporal trajectory data has made it challenging to efficiently store, process, and query such data. Furthermore, the increasing number of users also challenges the ability of trajectory-based services and analytics to handle the query workload and response to multiple requests in a satisfactory time.Over the last few years, a new class of systems has emerged to handle large amounts of data in an efficient manner, referred as distributed in-memory database systems. These systems were designed to overcome the difficulties to scale traditional structured and unstructured data loads that some applications must handle. Spark has become the framework of choice for large-scale low-latency data processing using distributed in-memory computation. However, Spark-based systems still lack the ability to handle several trajectory database tasks in a memory-wise manner. Some desirable feature of trajectory database systems include, data preparation and pre-processing, large-scale data storage and retrieval, and multi-user concurrent query processing. Providing a full-fledged system architecture supporting these features is challenging, and yet an issue.Therefore, driven by the increasing interest in scalable and efficient systems for trajectory-based analytics, we propose a distributed in-memory database system for memory-wise storage and scalable processing of spatial-temporal trajectory data, with low query latency and high throughput. We build our system on top of the Spark MapReduce framework, which provides an in-memory and fault-tolerant environment for distributed parallel processing of large-scale data. Existing works on spatial data in MapReduce, however, either lack support for spatial-temporal trajectory data, or only provide disk-based storage with costly I/O, which negatively affects query performance. Furthermore, none of the state-of-the-art applications address the problem of memory-wise utilization, which is the main drawback of in-memory based frameworks such as Spark. In this thesis we propose new features to the Spark framework, in order to provide native support for spatial-temporal trajectory data, with low latency, high throughput, and memory-wise storage.Our architecture follows a complete framework for trajectory data storage and processing, with trajectory data preparation, data pre-processing, data storage, and concurrent query processing. Firstly, we provide a novel model for trajectory data representation, and a system for loading, parsing, integration, and compression of trajectory data. Secondly, we introduce a novel framework for trajectory pre-processing using map-matching on top of Spark, in order to achieve data quality by means of data cleaning and simplification. Finally, we introduce two novel approaches for data storage and multi-user trajectory query processing on top of Spark. In the first approach, we proposed a novel partitioning and storage methods focused on distance-based queries; in addition, we provide a system for trajectory distance measures evaluation, due to the extensive number of techniques available. In the second approach, we propose a novel memory-wise and workload-aware system for trajectory data storage, focused on data retrieval and spatial-temporal queries over large scale trajectory data; a key feature of our system is the ability to identify query hotspots, and exchange data between main-memory and disk based on the query workload, yet leveraging the scalability, fault-tolerance, efficiency, and concurrency control features of Spark.Although the efforts of current techniques provide a good starting point for trajectory data management on top of Spark, they are unable to provide all the features of our work. The superiority of our architecture comes from the research and development of both novel and state-of-the-art techniques for trajectory data management, using a well-established framework for large-scale data applications. We believe our system will support scientists and professionals working with large-scale trajectory-based applications.

Full Text