The job replication problem has been studied recently as a mechanism to improve performance and availability of systems with $n$ n parallel servers, each with its own queue. A dispatcher using some policy sends $d ~(1 \leq d \leq n)$ d ( 1 ≤ d ≤ n ) copies of a job to $d$ d of the servers. Copies are eliminated from the system as soon as the first copy completes from any of the $d$ d servers. This article introduces a data-driven method to derive closed-form expressions for the average response time and other metrics of jobs as a function of the degree of replication $d$ d . This method consists of developing a simulator for the system in order to generate a very large number of datasets for a wide range of input parameters. A statistical and visualization analysis of the data provides the analytical models. It is important to emphasize the difference between using simulation methods to obtain the value of metrics (e.g., average response time) of a computer system given values of input parameters and using our data-driven method to obtain closed-form expressions that relate output metrics to input parameters. The latter is the focus of our approach. The analysis presented here covers results for homogeneous and heterogeneous servers with exponentially distributed service times and for homogeneous servers with hypo-exponentially and hyper-exponentially distributed service times. This article also presents a closed-form equation for the optimal replication degree for the case of homogeneous servers with hypo-exponentially distributed service times.
Read full abstract