Abstract

In aggressively scaled technologies, reliability concerns such as oxide breakdown have become a key issue. Dynamic reliability management (DRM) has been proposed as a mechanism to dynamically explore the trade-off between system performance and reliability margin. However, existing DRM methods are hampered by the fact that they do not accurately model spatial and temporal variations in process and temperature parameters which have a strong impact on chip reliability. In addition, they make the simplifying assumption that the future workloads are identical to the currently observed one. This makes them sensitive to sudden workload variations and outliers. In this paper, we present a novel workload-aware dynamic reliability management framework that accounts for local variations in both the process and temperature. The reliability estimation, along with the predicted remaining workload is fed to a dynamic voltage/frequency scaling module to manage the system reliability and optimize processor performance. Using a fast on-line analytical/table-look-up method we demonstrate an average error of 1% with up to 5 orders of magnitude speedup compared to Monte Carlo simulation. Experiments on an Alpha-like processor show our DRM framework fully utilizes the available margin and achieves 28.7% performance improvement on average.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call