Performance Optimization Strategies for Big Data Applications in Distributed Framework

Mir Wajahat Hussain,Diptendu Sinha Roy

doi:10.1007/978-981-99-1482-1_10

Abstract

The evolution and advancements in Information and Communication Technologies (ICT) have enabled large scale distributed computing with a huge chunk of applications for massive number of users. This has obviously generated large volumes of data, thus severely burdening the processing capacity of computers as well as the inflexible traditional networks. State-of-the-art methods for addressing datacenter level performance fixes are yet found wanting for sufficiently addressing this huge processing, storage, and network movement with proprietary protocols for this voluminous data. In this chapter the works have focused on addressing the backend server performance through effective reducer placement, intelligent compression policy, handling slower tasks and in-network performance boosting techniques through effective traffic engineering, traffic classification, topology discovery, energy minimization and load balancing in datacenter-oriented applications. Hadoop, the defacto standard in distributed big data storage and processing, has been designed to store data with its Hadoop Distributed File System (HDFS) and processing engine MapReduce large datasets reliably. However, the processing performance of Hadoop is critically dependent on the time taken to transfer data during the shuffle generated during MapReduce. Also, during concurrent execution of tasks, slower tasks need to be properly identified and efficiently handled to improve the completion time of jobs. To overcome these limitations, three contributions have been made; (i) Compression of generated map outputs at a suitable time when all the map tasks are yet to be completed to shift the load of network onto the CPU; (ii) Placing the reducer onto the nodes where the computation done is highest based on a couple of counters, one maintained at the rack level and another at node level, to minimize the run-time data copying; and (iii) Placing the slower map tasks onto the nodes where the computation done is highest and network is handled by prioritizing. Software defined networking (SDN) has been a boon for next generation networking owing to the separation of control plane from the data plane. It has the capability to address the network requirements in a timely manner by setting flows for every to and fro data movement and gathering large network statistics at the controller to make informed decisions about the network. A core issue in the network for the controller is traffic classification, which can substantially assist SDN controllers towards efficient routing and traffic engineering decisions. This chapter presents a traffic classification scheme utilizing three classifiers namely Feed-forward Neural Network (FFNN), Logistic Regression (LR), Naïve Bayes and employing Particle Swarm Optimization (PSO) for improved traffic classification with less overhead and without overlooking the key Quality of Service (QoS) criterion. Also lowering energy minimization and link utilization has been an important criterion for lowering the operating cost of the network and effectively utilizing the network. This issue has been addressed in the chapter by formulating a multi-objective problem while simultaneously addressing the QoS constraints by proposing a metaheuristic, since no polynomial solution exists and hence an evolutionary based metaheuristic (Clonal Selection) based energy optimization scheme, namely, Clonal Selection Based Energy Minimization (CSEM) has been devised. The obtained results show the efficacy of the proposed traffic classification scheme and CSEM based solution as compared with the state-of-the-art techniques. SDN has been a promising newer network paradigm but security issues and expensive capital procurement of SDN limit its full deployment hence moving to a hybrid SDN (h-SDN) deployment is only logical moving forward. The usage of both centralized and decentralized paradigms in h-SDN with intrinsic issues of interoperability poses challenges to key issues of topology gathering by the controller for proper allocation of network resources and traffic engineering for optimum network performance. State-of-the-art protocols for topology gathering, such as Link Layer Discovery Protocol (LLDP) and Broadcast Domain Discovery Protocol (BDDP) require a huge number of messages and such schemes only gather link information of SDN devices leaving out legacy switches’ (LS) links which results in sub-optimal performance. This chapter provides novel schemes which unearth topology discovery by requiring fewer messages and gathering link information of all the devices in both single and multi-controller environments (might be used when scalability issue is prevalent in h-SDN). Traffic engineering problems in h-SDN are addressed by proper placement of SDN nodes in h-SDN by utilizing the analyzing key criterion of traffic details and the degree of a node while lowering the link utilization in real-time topologies. The results of the proposed schemes for topology discovery and SDN node placement demonstrate the merits as compared with the state-of-the-art protocols.

Full Text