Abstract
Declustering techniques are widely used in distributed environments to reduce query response time through parallel I/O by splitting large files into several small blocks and then distributing those blocks among multiple storage nodes. Unfortunately, however, many small geospatial image data files cannot be further split for distributed storage. In this paper, we propose a complete theoretical system for the distributed storage of small geospatial image data files based on mining the access patterns of geospatial image data using their historical access log information. First, an algorithm is developed to construct an access correlation matrix based on the analysis of the log information, which reveals the patterns of access to the geospatial image data. Then, a practical heuristic algorithm is developed to determine a reasonable solution based on the access correlation matrix. Finally, a number of comparative experiments are presented, demonstrating that our algorithm displays a higher total parallel access probability than those of other algorithms by approximately 10–15% and that the performance can be further improved by more than 20% by simultaneously applying a copy storage strategy. These experiments show that the algorithm can be applied in distributed environments to help realize parallel I/O and thereby improve system performance.
Highlights
Declustering is one of the most effective methods in the field of parallel I/O and can be widely used to improve system performance by splitting and distributing large files among multiple storage nodes to speed up access to data
To evaluate the performance of the algorithm, several tasks were experimentally investigated: 1) selecting the geospatial image dataset to be stored in distributed storage nodes; 2) finding the optimal T based on the historical access log information recorded by the Digital Earth server [20] using the heuristic algorithm proposed in section 4; 3) requesting the same dataset simultaneously based on other historical access log information; and 4) computing the total parallel access probability (TPAP) performance and comparing it with those of Location-based distributed Storage Algorithm (LSA) and Random distributed Storage Algorithm (RSA)
We define the TPAP performance as follows: XL Xm xij where L×m denotes the total number of requests for small files over a long period and xij denotes whether the jth storage node is accessed during a short period
Summary
Declustering is one of the most effective methods in the field of parallel I/O and can be widely used to improve system performance by splitting and distributing large files among multiple storage nodes to speed up access to data. The Google file system (GFS) is a well-known distributed file system in which each large file is divided into several blocks of fixed size. Each block (approximately 64 megabytes (MB)) is stored in multiple different storage nodes to enhance concurrency and system performance [1]. A number of other similar systems, such as RAID (Redundant Array of Independent Disks) systems [2] and geospatial information systems (GISs) [3], have been developed, all of which use declustering technologies for the distributed storage of large files.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.