Dynamic Load Balancing for Direct Server Return Networks Using eBPF for In-Band Metric Feedback

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract Direct Server Return (DSR) enables backend servers to send responses directly to clients, bypassing the load balancer on the return path. Removing that extra hop trims end-to-end latency and prevents the balancer from becoming a bottleneck at high request rates. This paper introduces a backward-compatible DSR variant that encodes each server’s load metric inside an Internet Protocol (IP) option, so the metric travels with ordinary data packets, and no polling traffic is needed. A Linux extended Berkeley Packet Filter (BPF) prototype adds only a small patch to the data path, yet yields up to 47% more requests per second than an explicit-polling baseline, requiring no changes to either the client or server. The proposed solution does not modify application logic and supports dynamic load balancing in heterogeneous and variable workloads, such as microservices, batch processing, or machine learning inference. It is fully deployable on commodity servers, runs entirely in kernel space, and eliminates separate metric-collection traffic. Performance evaluation demonstrates significant throughput and latency improvements needed for large-scale and low-overhead load balancing of real deployments.

Similar Papers
  • Research Article
  • Cite Count Icon 11
  • 10.1109/tc.2023.3257513
SAMBA: Sparsity Aware In-Memory Computing Based Machine Learning Accelerator
  • Sep 1, 2023
  • IEEE Transactions on Computers
  • Dong Eun Kim + 3 more

Machine Learning (ML) inference is typically dominated by highly data-intensive Matrix Vector Multiplication (MVM) computations that may be constrained by memory bottleneck due to massive data movement between processor and memory. Although analog in-memory computing (IMC) ML accelerators have been proposed to execute MVM with high efficiency, the latency and energy of such computing systems can be dominated by the large latency and energy costs from analog-to-digital converters (ADCs). Leveraging sparsity in ML workloads, reconfigurable ADCs can save MVM energy and latency by reducing the required ADC bit precision. However, such improvement in latency can be hindered by non-uniform sparsity of the weight matrices mapped into hardware. Moreover, data movement between MVM processing cores may become another factor that delays the overall system-level performance. To address these issues, we propose SAMBA, Sparsity Aware IMC Based Machine Learning Accelerator. First, we propose load balancing during mapping of weight matrices into physical crossbars to eliminate non-uniformity in the sparsity of mapped matrices. Second, we propose optimizations in arranging and scheduling the tiled MVM hardware to minimize the overhead of data movement across multiple processing cores. Our evaluations show that the proposed load balancing technique can achieve performance improvement. The proposed optimizations can further improve both performance and energy-efficiency regardless of sparsity condition. With the combination of load balancing and data movement optimization in conjunction with reconfigurable ADCs, our proposed approach provides up to 2.38x speed-up and 1.54x energy-efficiency over stateof- art analog IMC based ML accelerators for ImageNet datasets on Resnet-50 architecture.

  • Research Article
  • 10.12694/scpe.v10i2.606
Parallel and Distributed Computing Techniques, Selection of papers from ISPDC 2008
  • Jan 1, 2009
  • Scalable Computing Practice and Experience
  • Marek Tudruj

Dear SCPE Reader, We present a selection of papers which are extensions of papers presented at the 7-th International Symposium on Parallel and Distributed Computing, 1–5 July 2008, in Krakow, Poland. The motivation for publishing the selection in the SCPE Journal was, on the one hand, to present the flavour of the research reported at the conference and on the other hand to present some of the most relevant topics currently focused on the research on parallel and distributed computing in general. The selection contains only 6 papers out of about 60 presented at the conference, and thus, is far from covering all relevant topics represented at the ISPDC 2008. This is because not all of the invited authors were patient enough to accept a fairly long paper publishing process. Nevertheless, we hope that the presented papers will bring you closer to the research covered by the ISPDC conferences and will encourage you to participate in future ISPDC editions. The first paper ``The Impact of Workload Variability on Load Balancing Algorithms'' is by Marta Beltran and Antonio Guzman from King Juan Carlos University in Spain. It concerns an important topic of load balancing in cluster systems, namely adaptativity of the load balancing algorithms to changes of the workload in the system. Adequate accounting for additional load in the hosting system is of great relevance for correct optimization effects. The paper presents a thorough formal analysis of the workload variability metrics and their influence on the quality of load balancing algorithms. Four basic activities appearing in load balancing algorithms are identified, and based on them some algorithmic solutions are proposed to correctly deal with workload variability in system load balancing. The problem of dynamic load balancing algorithms robustness has been discussed. Two different robustness metrics sensitive to the applied type of opimization: local task-oriented or a global one enable selecting task remote execution or migration as load balancing operations. The proposed approach is illustrated with experiments. The second paper ``Model-Driven Engineering and Formal Validation of High-Performance Embedded Systems'' is by Abdoulaye Gamatie, Eric Rutten, Huafeng Yu, Pierre Boulet, Jean-Luc Dekeyser, from University of Lille and INRIA in France. The paper is concerned with a very advanced methodology of designing correct parallel embedded systems for intensive data-parallel computing. In their previous research, the authors of the paper designed the GASPARD embedded system design framework. It is based on the hardware/software co-design approach through model-driven engineering. The framework is based on an UML-like model specification language in which hardware and software elements are modelled using a component approach with special mechanisms for repetitive structures. This paper tries to combine the modelling framework of GASPARD with the mechanisms of synchronous languages to achieve design verifiability provided for such languages. The paper shows how GASPARD models can be translated into synchronous models based on data flow equations in order to formally check their correctness. The proposed approach is illustrated with an example of a video processing system. The third paper ``Relations Between Several Parallel Computational Models'' is by Stephan Bruda and Yuanqiao Zhang from Bishop’s University in Canada. The paper is concerned with theoretical aspects of shared memory systems described by the parallel random access machine PRAM model and aims in studying performance properties of different types of PRAM systems. The attention is focussed on analysing the computational power of two more sophisticated PRAM models (Combining CRCW and Broadcast Selective Reduction), which include data reduction in case of concurrent writes. The paper shows that these two models have equivalent computational power, which is a new result comparing the existing literature. The performance of both models applied to reconfigurable multiple bus machines was studied as a possible architectural solution for current VLSI processor implementations. It was shown that in such systems under reasonable assumptions concurrent-write does not enhance performance comparing the exclusive-write model. Another result important for the VLSI technology is that the Combining CRCW PRAM model (in which data of concurrent writes are arthmetically or logically combined before write) and the exclusive-write on directed reconfigurable busses perform in equivalent way under strong real-time requirements. The fourth paper ``Experiences with Mesh-Like Computations Using Prediction Binary Trees'' is by Gennaro Cordasco, Biagio Cosenza, Rosario de Chiara, Ugo Erra and Vittorio Scarano from the University ``degli Studi'' of Salerno and the University ``degli Studi della Biasilicata'' of Potenza in Italy. The paper concerns optimization methods for mesh-like computations in clusters of processors. The computations are perfomed assuming a phase-like program execution control using a tiling approach which reduces inter-processor communication. A temporal coherence is also assumed, which means that task sizes provide similar execution times in consecutive phases. Temporary coherent computations are structured in a Prediction Binary Tree, in which leaves represent computing tiles to be mapped to processors. A phase-by-phase semi-static load balancing is introduced to the scheduling algorithm. The scheduling algorithm is equipped with a predictor, which estimates the computation time of next phase tiles based on previous execution times and modifies the tiles to achieve balanced execution in phases. For this, two heuristics are used to leverage on data locality in processors. The proposed approach is illustrated by the example of interactive rendering with Parallel Ray Tracing algorithm. The fifth paper ``The Influence of the IBM pSeries Servers Virtualization Mechanism on Dynamic Resource Allocation in AIX 5L'' is by Maciej Mlynski from ASpartner Limited in Poland. The paper concerns a very up-to-date problem of system virtualization and presents the results of research carried on IBM pSeries servers. IBM is strongly developing the virtualization technique especially on IBM pSeries servers enabling an improved and flexible sharing of system resources between applications. The paper investigates novel facilities for dynamic resource management such as micro-partitioning and partition load manager. They enable dynamic creation of workload logical partitions of system resources and their dynamic mangement. It includes run-time resource re-alocation between logical partitions including setting of sharing specifications as well as run-time adding/removing/setting parameters of resources in the system. It remains an open question how to properly tune parameters of the operating system using the provided virtualization facilities to obtain the best efficiency for a given application program. The paper presents the results of experiments which study the effects of tuning the disk subsystem parameters under the IBM AIX 5L operating system with the use of the provided virtualization facilities on the resulting application execution performance. The results show that even small deterioration in the resource pool status requires an immediate adaptation of the operating system parameters to maintain the required performance. The sixth paper ``HeteroPBLAS: A Set of Parallel Basic Linear Algebra Subprograms Optimized for Heterogeneous Computational Clusters'' is by Ravi Reddy, Alexey Lastovetsky and Pedro Alonso from University College Dublin in Ireland and Polytechnic University of Valencia in Spain. The paper concerns the methodology for parallelization of linear algebra computations for execution in heterogeneous cluster environments. The design of the HeteroPBLAS library (Parallel Basic Linear Algebra Subprograms) for heterogeneous computational clusters is presented. The main contribution of the paper is the automation of the parallelization and optimization of the PBLAS, which is done by means of a special user interface and the underlying set of functions. An important element is here a performance model that is based on program code instrumentation, which determines parameters of the application and the executive heterogeneous platform relevant for execution performance of parallel code. The parameter values specified for or returned by execution of the performance model functions are next used for generation and optimal mapping of the parallel code of the library subroutines. The proposed approach is illustrated by experimental results of execution of optimized HeteroPBLAS programs on homogeneous and heterogeneous computing clusters. Marek Tudruj

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/clustr.2009.5289142
Fast-Response Dynamic Routing Balancing for high-speed interconnection networks
  • Jan 1, 2009
  • D Lugones + 2 more

Communication requirements in High Performance Computing systems demand the use of high-speed Interconnection networks to connect processing nodes. However, when communication load is unfairly distributed across the network resources, message congestion appears. Congestion spreading increases latency and reduces network throughput causing important performance degradation. The Fast-Response Dynamic Routing Balancing (FR-DRB) is a method developed to perform a uniform balancing of communication load over the interconnection network. FR-DRB distributes the message traffic based on a gradual and load-controlled path expansion. The method monitors network message latency and makes decisions about the number of alternative paths to be used between each source-destination pair for message delivery. FR-DRB performance has been compared with other routing policies under a representative set of traffic patterns which are commonly created by parallel scientific applications. Experiments results show an important improvement in latency and throughput.

  • Conference Article
  • Cite Count Icon 106
  • 10.1145/2619239.2626317
Duet
  • Aug 17, 2014
  • Rohan Gandhi + 6 more

Load balancing is a foundational function of datacenter infrastructures and is critical to the performance of online services hosted in datacenters. As the demand for cloud services grows, expensive and hard-to-scale dedicated hardware load balancers are being replaced with software load balancers that scale using a distributed data plane that runs on commodity servers. Software load balancers offer low cost, high availability and high flexibility, but suffer high latency and low capacity per load balancer, making them less than ideal for applications that demand either high throughput, or low latency or both. In this paper, we present Duet, which offers all the benefits of software load balancer, along with low latency and high availability -- at next to no cost. We do this by exploiting a hitherto overlooked resource in the data center networks -- the switches themselves. We show how to embed the load balancing functionality into existing hardware switches, thereby achieving organic scalability at no extra cost. For flexibility and high availability, Duet seamlessly integrates the switch-based load balancer with a small deployment of software load balancer. We enumerate and solve several architectural and algorithmic challenges involved in building such a hybrid load balancer. We evaluate Duet using a prototype implementation, as well as extensive simulations driven by traces from our production data centers. Our evaluation shows that Duet provides 10x more capacity than a software load balancer, at a fraction of a cost, while reducing latency by a factor of 10 or more, and is able to quickly adapt to network dynamics including failures.

  • Research Article
  • Cite Count Icon 113
  • 10.1145/2740070.2626317
Duet
  • Aug 17, 2014
  • ACM SIGCOMM Computer Communication Review
  • Rohan Gandhi + 6 more

Load balancing is a foundational function of datacenter infrastructures and is critical to the performance of online services hosted in datacenters. As the demand for cloud services grows, expensive and hard-to-scale dedicated hardware load balancers are being replaced with software load balancers that scale using a distributed data plane that runs on commodity servers. Software load balancers offer low cost, high availability and high flexibility, but suffer high latency and low capacity per load balancer, making them less than ideal for applications that demand either high throughput, or low latency or both. In this paper, we present Duet, which offers all the benefits of software load balancer, along with low latency and high availability -- at next to no cost. We do this by exploiting a hitherto overlooked resource in the data center networks -- the switches themselves. We show how to embed the load balancing functionality into existing hardware switches, thereby achieving organic scalability at no extra cost. For flexibility and high availability, Duet seamlessly integrates the switch-based load balancer with a small deployment of software load balancer. We enumerate and solve several architectural and algorithmic challenges involved in building such a hybrid load balancer. We evaluate Duet using a prototype implementation, as well as extensive simulations driven by traces from our production data centers. Our evaluation shows that Duet provides 10x more capacity than a software load balancer, at a fraction of a cost, while reducing latency by a factor of 10 or more, and is able to quickly adapt to network dynamics including failures.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3018009.3018014
Software-defined load balancer in cloud data centers
  • Nov 26, 2016
  • Renuga Kanagavelu + 1 more

Today's Data Centers deploy load balancers to balance the traffic load across multiple servers. Commercial load balancers are highly specialized machines that are located at the front end of a Data Center. When a client request arrives at the Data Center, the load balancer would determine the server to service this client's request. It routes the request to an appropriate server based on the native policies such as round-robin, random or others without considering the traffic state. It is not possible to implement arbitrary polices as load balancers as they are vendor specific. Apart from that, the piece of hardware is expensive and becomes single point of failure. In this paper, we develop a software defined network (SDN) based load balancing architecture with a load-aware policy using OpenFlow switch connected to SDN controller and commodity servers. It is less expensive when compared to the commercial load balancer and has programming flexibility in terms of applying arbitrary polices by writing modules in the SDN controller. With the facility of supporting multiple controller connections in commercially-available OpenFlow switches, the system is robust to single point of controller failures. We develop a prototype implementation of the proposed SDN based load balancer and carry out performance study to demonstrate its effectiveness.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/secon.1994.324331
Throughput improvement through dynamic load balance
  • Apr 10, 1994
  • H.B More + 1 more

Dynamic load balance improves the performance of a multiprocessor system by reallocating tasks such that all the processors are evenly loaded. Problems in several areas qualify for dynamic load balance. The paper studies the performance of load balancing while solving a branch and bound problem using hypercubes. If the hypercube has link faults, special measures need to be taken to balance the load. An algorithm for load balancing in the presence of link faults is described. Performance improvement obtained with the help of load balancing using the link-fault-tolerant algorithm is observed through simulation. >

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/i-span.2009.136
Distributed Adaptive Load Balancing for P2P Grid Systems
  • Jan 1, 2009
  • Po-Jung Huang + 3 more

Due to the demand for the mass distributed computing and efficient data transmission, grid systems start to integrate with P2P technology to support high-performance distributed computing. However, the workload on P2P grid computing systems could be highly variable and its unstable behavior could intensely affect the system performance. In general, the high variability of the workload leads to wrong load balancing decisions made by out-of-date resource status and the wrong decisions are difficult to be corrected during execution. This study proposes a dynamic adaptive load balancing strategy to dynamically balance the workload across grid sites. This load balancing strategy can not only deal with the workload variability, but also improve the resource utilization in P2P grid systems. This prototype is implemented on the sites of the Taiwan UniGrid. The experimental results show that the proposed algorithm performs well and could efficiently distribute the workload for execution.

  • Research Article
  • 10.3390/ijgi14030109
Dynamic Load Balancing Based on Hypergraph Partitioning for Parallel Geospatial Cellular Automata Models
  • Mar 1, 2025
  • ISPRS International Journal of Geo-Information
  • Wei Xia + 5 more

Parallel computing techniques have been adopted in geospatial cellular automata (CA) models to improve computational efficiency, enabling large-scale complex simulations of land use and land cover (LULC) changes at fine scales. However, the spatial distribution of computational intensity often changes along with the spatiotemporal dynamics of LULC during the simulation, leading to an increase in load imbalance among computing units and degradation of the computational performance of a parallel CA. This paper presents a dynamic load balancing method based on hypergraph partitioning for multi-process parallel geospatial CA models. During the simulation, the sub-domains are dynamically reassigned to computing processes through hypergraph partitioning according to the spatial variation in computational workloads to restore load balance. In addition, a novel mechanism called Migrated-SubCellspaces-First (MSCF) is proposed to reduce the cost of workload migration by employing a non-blocking communication technique to further improve computational performance. To demonstrate and evaluate the effectiveness of our method, a parallel geospatial CA model with hypergraph-based dynamic load balancing is developed. Experiments using a dataset from California showed that the proposed dynamic load balancing method achieved a computational performance enhancement of 62.59% by using 16 processes compared with a parallel CA with static load balancing.

  • Conference Article
  • Cite Count Icon 25
  • 10.1109/ride.1992.227420
Chained declustering: load balancing and robustness to skew and failures
  • Feb 2, 1992
  • L Golubchik + 2 more

There has been considerable research concerning the use of arrays of disks in solving I/O bottleneck problems, where high availability of data is achieved through some form of data redundancy, e.g. mirroring. This paper investigates the degree to which a dynamic load balancing disk scheduling algorithm in conjunction with chained declustering, an alternative to the classical mirroring scheme, can respond robustly to variations in workload and disk failures. Specifically, it defines and investigates the behavior of two dynamic scheduling algorithms under various workload distributions and disk failure. It demonstrates that using a simple dynamic scheduling algorithm can greatly improved the average response time compared with static load balancing. >

  • Research Article
  • Cite Count Icon 413
  • 10.1145/263326.263344
Exploiting process lifetime distributions for dynamic load balancing
  • Aug 1, 1997
  • ACM Transactions on Computer Systems
  • Mor Harchol-Balter + 1 more

We consider policies for CPU load balancing in networks of workstations. We address the question of whether preemptive migration (migrating active processes) is necessary, or whether remote execution (migrating processes only at the time of birth) is sufficient for load balancing. We show that resolving this issue is strongly tied to understanding the process lifetime distribution. Our measurements indicate that the distribution of lifetimes for a UNIX process is Pareto (heavy-tailed), with a consistent functional form over a variety of workloads. We show how to apply this distribution to derive a preemptive migration policy that requires no hand-tuned parameters. We used a trace-driven simulation to show that our preemptive migration strategy is far more effective than remote execution, even when the memory transfer cost is high.

  • Research Article
  • 10.1504/ijnvo.2019.10017050
Priority-based bandwidth allocation and load balancing for multipath IP networks
  • Jan 1, 2019
  • International Journal of Networking and Virtual Organisations
  • V Rekha + 1 more

Frequent route failures are common in a multipath internet protocol (IP) network. Backup configuration is one of the techniques used to re-establish alternate path in case of route failure. The existing multi route configurations (MRC) for fast IP network does not address quality of service (QoS) issues such as bandwidth optimisation and load balancing with traffic shaping. There is a need to focus on priority-based routing and load balancing during congestion in multipath networks. In this paper, we are proposing a priority-based bandwidth allocation and load balancing (PBALB) approach for multipath routing. In order to reduce packet drop and to enhance fairness, throughput and lower delay transmissions, traffic shaping based on different type of traffic flows in differential service domain is proposed. Experimental result shows that the proposed PBALB technique improves the throughput compared to MRC.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.14569/ijacsa.2022.0130414
Software Defined Network based Load Balancing for Network Performance Evaluation
  • Jan 1, 2022
  • International Journal of Advanced Computer Science and Applications
  • Omran M A Alssaheli + 3 more

Load balancing distributes incoming network traffic across multiple controllers that improve the availability of the internet for users. The load balancing is responsible to maintain the internet availability to users in 24 hours by 7 days a week. However, the internet become unavailable since the load balancer is inflexibility, costly, and non-programmable for settings adjustment especially in managing the network traffic congestion. An increasing user using mobile devices and cloud facilities, the current load balancer has limitations and demands for the deployment of a Software-Defined Network (SDN). SDN decouples network control, applications, network services, and forwarding roles; hence makes the network more flexible, affordable, and programmable. Furthermore, it has been found that SDN load balancing performs intelligent action, efficient and maintains better QoS (Quality of Service) performance. This study proposes the application of SDN-based Load Balancing since it provides pre-defined servers in the server-farm that receive the arrived Internet Protocol (IP) data packet from various clients in the same number of loads and process orders for each server. Experiments have been conducted using Mininet™ and based on several scenarios (Scenario A, Scenario B, and Scenario C) of network topologies. Parameters used to evaluate the load balancing in SDN are throughput, delay, and jitter. Findings indicated that scenario A gives a high throughput, scenario B and C produce a low jitter values and scenario C produces the lowest delay. The impact of SDN brings a multi-path adaptive direction in finding the best route for a better network performance.

  • Book Chapter
  • 10.1016/b978-0-12-809927-8.00014-2
Chapter 8 - Demultiplexing
  • Jan 1, 2022
  • Network Algorithmics
  • George Varghese + 1 more

Chapter 8 - Demultiplexing

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-981-15-1097-7_2
A Dynamic ACO-Based Elastic Load Balancer for Cloud Computing (D-ACOELB)
  • Jan 1, 2020
  • K Jairam Naik

Cloud computing is for the delivery of computational service like servers, software, databases, storage, networks, intelligence, analytics, and more via the Internet (“the cloud”) for tendering faster innovation, economies of scale, and flexible resources. The workload in cloud computing is defined as the amount of computational work at a given time that the computer has been given to do. This computational workload comprises of some number of users connected to and interacting with the computer’s applications additionally to the computer’s application programs running in the system. This workload may alter dynamically based on the available resources and the end-users. Hence, the key challenging duty in the cloud computing is balancing the workload among the Virtual Machines (VM’s) of the systems. So, there is a need for introducing an elastic dynamic load balancer for distributing the workloads efficiently in the cloud. A load balancer can distribute the loads among the VM’s of multiple servers or compute resources. The utilization of such elastic and dynamic load balancer can enhance the fault tolerance capability of applications and the availability of cloud resources. As your needs change, adding or removing of computing resources without disrupting the overall flow of requests is made possible utilizing this elastic and dynamic load balancer. Generally, an elastic load balancing task chains three types of balancers, they are Application Load Balancers, Classic Load Balancers, and Network Load Balancers. The Application Load Balancer can be chosen based on the need of applications. In this chapter, it proposed a dynamic and elasticity approach (D-ACOELB) for workload balancing in cloud data centers based on Ant Colony Optimization (ACO).

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.