A Design of Autonomous Error-Tolerant Architectures for Massively Parallel Computing

Lizheng Liu,Yi Liu,Lirong Zheng,Yuxiang Huan,Zhuo Zou,Yi Jin,Ning Ma

doi:10.1109/tvlsi.2018.2846298

Abstract

The massively parallel computing systems composed of many processors are connected on chips, which will become more and more complex and unreliable. This paper presents an error-tolerant design based on the autonomous error-tolerant (AET) architecture that aims to have a self-repairing capability. A nearby error sensing mechanism is designed to discover faults, and an active evolution scheme is studied to handle unrecoverable errors. A circuit backup switching mechanism is proposed to bypass the failed nodes. The board-level prototype is implemented based on dual-core embedded processors. The analysis shows that the error-tolerant capability of the proposed architecture is better than the conventional multimodular redundant system when the failure rate of a single core is less than 0.7. In the AET test system consisting of 16 processors, the error-tolerant capability is verified. The results show that the relative variation of the overall performance of the AET system will not be changed due to the high reliability requirements of the system. Through experimental comparison, under the premise that the architecture of AET and the triple modular redundancy method are basically consistent in reliability, whether on the logical-level error tolerant or on the physical-level error tolerant, the former has lower power consumption.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Design of Autonomous Error-Tolerant Architectures for Massively Parallel Computing

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Lead the way for us

Journal: IEEE Transactions on Very Large Scale Integration (VLSI) Systems	Publication Date: Jan 1, 2018
Citations: 25

Similar Papers

Cost-effective memory protection and reliability evaluation based on machine error-tolerance: A case study on no-accuracy-loss YOLOv4 object detection model
Tong-Yu Hsieh ... Wei-Ji Chao
Microelectronics Reliability | VOL. 147
Tong-Yu Hsieh, et. al.Tong-Yu Hsieh ... Wei-Ji Chao
14 Jun 2023
Microelectronics Reliability | VOL. 147

Partial TMR method for on‐orbit processors based on PageRank algorithm
Zhu Yang ... Hangyu Wang
Electronics Letters | VOL. 55
Zhu Yang, et. al.Zhu Yang ... Hangyu Wang
01 Feb 2019
Electronics Letters | VOL. 55

General Efficient TMR for Combinational Circuit Hardening Against Soft Errors and Improved Multi-Objective Optimization Framework
Chiyu Tan ... Yan Li
IEEE Transactions on Circuits and Systems I: Regular Papers | VOL. 68
Chiyu Tan, et. al.Chiyu Tan ... Yan Li
01 Jul 2021
IEEE Transactions on Circuits and Systems I: Regular Papers | VOL. 68

Triple transistor based triple modular redundancy with embedded voter circuit
Atin Mukherjee ... Anindya Sundar Dhar
Microelectronics Journal | VOL. 87
Atin Mukherjee, et. al.Atin Mukherjee ... Anindya Sundar Dhar
05 Apr 2019
Microelectronics Journal | VOL. 87

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Design of Autonomous Error-Tolerant Architectures for Massively Parallel Computing

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Very Large Scale Integration (VLSI) Systems