Abstract

The Internet of Things (IoT) has shown its promising future recently. Cloud computing can provide the infrastructure for storing and handling the potentially enormous volume of data generated therein. Consequently, the availability and reliability of cloud data will largely affect the success of IoT. Hadoop is a very popular platform adopted in the community of cloud computing. The Hadoop Distributed File System (HDFS) is the default file system in Hadoop. HDFS keeps multiple copies of data files within a Hadoop cluster to avoid losing data. However, this approach still cannot guarantee the availability and reliability of data when fatal disasters, such as fire or earthquakes, destroy the entire Hadoop cluster. As a result, maintaining data backup among different Hadoop clusters is a must to achieve high availability and reliability of cloud data. Currently, distcp is the only tool HDFS provides to duplicate data files among Hadoop clusters installed at different locations. Unfortunately, users need to manually execute distcp, which cannot promise the timely synchronization of duplicated data files among Hadoop clusters. Besides, distcp always transfers the entire contents of data files between Hadoop clusters regardless how small the amount of new data is updated. Obviously, this could waste considerable time and network bandwidth in practice. We designed and implemented an efficient scheme, namely syncopy (synchronous copy), in HDFS to automatically conduct real time synchronization for data files duplicated among different Hadoop clusters. Compared with distcp, our experimental results show that syncopy can reduce the required time by up to 99.20%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call