Networks-on-chip (NoCs) have been increasingly adopted in recent years due to the extensive integration of many components in modern multicore processors and system-on-chip designs. At the same time, transistor reliability is becoming a major concern due to the continuous scaling of silicon. As the sole medium of on-chip communication, it is critical for a NoC to be able to tolerate many permanent transistor failures. In this paper, we propose uDIREC, a unified framework for permanent fault diagnosis and subsequent reconfiguration in NoCs, which provides graceful performance degradation with an increasing number of faults. Upon in-field transistor failures, uDIREC leverages a fine-resolution diagnosis mechanism to disable faulty components very sparingly. At its core, uDIREC employs MOUNT, a novel routing algorithm to find reliable and deadlock-free routes that utilize all the still-functional links in the NoC. We implement uDIREC's reconfiguration as a truly-distributed hardware solution, still keeping the area overhead at a minimum. We also propose a software-implemented reconfiguration that provides greater integration with our software-based diagnosis scheme, at the cost of distributed nature of implementation. Regardless of the adopted implementation scheme, uDIREC places no restriction on topology, router architecture and number and location of faults. Experimental results show that uDIREC, implemented in a 64-node NoC, drops 3 <inline-formula> <tex-math notation="LaTeX">$\times$</tex-math></inline-formula> fewer nodes and provides greater than 25 percent throughput improvement (beyond 15 faults) when compared to other state-of-the-art fault-tolerance solutions. uDIREC's improvement over prior-art grows further with more faults, making it a effective NoC reliability solution for a wide range of fault rates.
Read full abstract