Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

Yanjing Li,Subhasish Mitra,Samy Makar,Eric Cheng

doi:10.1109/test.2013.6651907

Abstract

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

Full Text