Abstract

We propose a novel two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable large-scale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To-Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8×, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133×, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way single-error correcting double-error detecting (SECDED). The cost of the proposed approach is no more than 4% external memory access overhead.

Highlights

  • Reliability is a main concern in future multi-core and many-core designs

  • Based on the low latency and power overhead of rectangular codes, and the high error correction capability of Hamming product codes, we propose a novel two-layer error control codes (ECC), combining the error detection capability of rectangular codes and the error correction capability of Hamming product codes in an efficient way, to improve system reliability while maintaining low area, power, and latency overhead [4]

  • 16-way single-error correcting double-error detecting (SECDED): 16 interleaved SECDED codes are applied for every cache line; 8-way SECDED: 8 interleaved SECDED codes are applied for every cache line; 8-way double-error correcting triple-error detecting (DECTED): 8 interleaved DECTED codes are applied for every cache line; 4-way DECTED: 4 interleaved DECTED codes are applied for every cache line; 4-way 4EC5ED: 4 interleaved 4EC5ED codes are applied for every cache line; 2-way 4EC5ED: 2 interleaved 4EC5ED codes are applied for every cache line

Read more

Summary

Introduction

Reliability is a main concern in future multi-core and many-core designs. On the one hand, technology scaling makes it possible to integrate ever-increasing numbers of transistors on a single chip, enabling a many-core system design; on the other hand, scaling has brought about an increase in various error sources, such as process, voltage and temperature (PVT) variation, electromagnetic radiation and device aging. It is necessary to provide fault tolerance for cache to improve system reliability in future many-core systems. An ever-increasing cache error rate, especially multi-bit upset (burst error), requires high fault-tolerant mechanisms for cache to improve system reliability. It is critical that latency, area and power overhead introduced by these fault-tolerant mechanisms must be minimized to meet the strict constraints of power and latency budget for future many-core systems.

Related Works
Soft Error Management Techniques
Hard Error Management Techniques
Reliable Circuits and Device Design for Cache
Two-Layer Error Control Codes Combining Rectangular and Hamming Product Codes
Rectangular Codes
Hamming Product Codes
Two-Layer ECC Combining Rectangular and Hamming Product Codes
Analysis and Experimental Results
Experimental Setup
Burst Error Control Capability
Mean-Error-to-Failure Evaluation
Mean-Time-to-Failure Evaluation
Overhead Evaluation
Performance Degradation
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.