Hadoop on Named Data Networking

Mathias Gibbens,Chris Gniady,Lei Ye,Beichuan Zhang

doi:10.1145/3143314.3078508

Abstract

In today's data centers, clusters of servers are arranged to perform various tasks in a massively distributed manner: handling web requests, processing scientific data, and running simulations of real-world problems. These clusters are very complex, and require a significant amount of planning and administration to ensure that they perform to their maximum potential. Planning and configuration can be a long and complicated process; once completed it is hard to completely re-architect an existing cluster. In addition to planning the physical hardware, the software must also be properly configured to run on a cluster. Information such as which server is in which rack and the total network bandwidth between rows of racks constrain the placement of jobs scheduled to run on a cluster. Some software may be able to use hints provided by a user about where to schedule jobs, while others may simply place them randomly and hope for the best. Every cluster has at least one bottleneck that constrains the overall performance to less than the optimal that may be achieved on paper. One common bottleneck is the speed of the network: communication between servers in a rack may be unable to saturate their network connections, but traffic flowing between racks or rows in a data center can easily overwhelm the interconnect switches. Various network topologies have been proposed to help mitigate this problem by providing multiple paths between points in the network, but they all suffer from the same fundamental problem: it is cost-prohibitive to build a network that can provide concurrent full network bandwidth between all servers. Researchers have been working on developing new network protocols that can make more efficient use of existing network hardware through a blurring of the line between network layer and applications. One of the most well-known examples of this is Named Data Networking (NDN), a data-centric network architecture that has been in development for several years. While NDN has received significant attention for wide-area Internet, a detailed understanding of NDN benefits and challenges in the data center environment has been lacking. The Named Data Networking architecture retrieves content by names rather than connecting to specific hosts. It provides benefits such as highly efficient and resilient content distribution, which fit well to data-intensive distributed computing. This paper presents and discusses our experience in modifying Apache Hadoop, a popular MapReduce framework, to operate on an NDN network. Through this first-of-its-kind implementation process, we demonstrate the feasibility of running an existing, large, and complex piece of distributed software commonly seen in data centers over NDN. We show advantages such as simplified network code and reduced network traffic, which are beneficial in a data center environment. There are also challenges faced by NDN that are being addressed by the community, which can be magnified under data center traffic. Through detailed evaluation, we show a reduction of 16% for overall data transmission between Hadoop nodes while writing data with default replication settings. Preliminary results also show promise for in-network caching of repeated reads in distributed applications. We show that while overall performance is currently slower under NDN, there are challenges and opportunities for further NDN improvements.

Full Text