Fault-Tolerant Network-On-Chip

Xiaowei Li,Guihai Yan,Cheng Liu

doi:10.1007/978-981-19-8551-5_4

Abstract

AbstractManycore systems are emerging for tera-scale computation and typically utilize Network-on-Chip (NoC) as the communication fabrics between the cores. Since a single routing node failure in NoC can destroy the connectivity of the entire manycore system, NoC is of essential importance to the manycore system. To improve the reliability of NoCs, we investigate fault-tolerant design approaches from different angles including fault-tolerant NoC architecture, fault-tolerant routing, and fault-tolerant circuits respectively. From the perspective of fault-tolerant NoC architecture, we propose a topology reconfiguration technique that re-defines a regular virtual topology on top of the original NoC with random faulty nodes. By introducing two new metrics, namely Distance Factor (DF) and Congestion Factor (CF), we can evaluate the performance of different virtual topologies efficiently. Moreover, We also propose Row Rippling Column Stealing-guided Simulated Annealing algorithm to determine the optimized virtual topology without affecting high-level parallel applications on the manycore system. From the perspective of fault-tolerant routing, we propose ZoneDefense routing that helps to find the faulty blocks in advance and route around the faulty routers. Unlike prior fault-tolerant routing algorithms that generally disable a set of routers directly or indirectly affected by hardware faults because of deadlock routing rules, ZoneDefense can reduce a large number of sacrificed fault-free routers significantly. From the perspective of fault-tolerant circuit designs, we develop a novel salvaging scheme named RevivePath, which allows faulty NoC data paths to be functional. The basic idea is to have serial-to-parallel and parallel-to-serial circuits inserted between NoC data path components such as crossbar, link, and on-chip buffers such that hardware faults will not easily corrupt these data paths and routing algorithms. Hence, the salvaging circuits ensure highly resilient NoC architecture and graceful performance degradation given increasing hardware faults.

Full Text