Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

Omer Khan,Sandip Kundu

doi:10.1109/tc.2009.76

Abstract

As the semiconductor industry continues its relentless push for nano-CMOS technologies, device reliability and occurrence of hard errors have emerged as a dominant concern in multicores. Although regular memory structures are protected against hard errors using error correcting codes or spare rows and columns, many of the structures within the cores are left unprotected. Even if the location of hard errors is known a priori, disabling faulty cores results in a substantial performance loss. Several proposed techniques use microarchitectural redundancy to allow defective cores to continue operation. These techniques are attractive, but limited due to either added cost of additional redundancy that offers no benefits to an error-free core, or limited coverage, due to the natural redundancy offered by the microarchitecture. We propose to exploit the intercore redundancy in chip multiprocessors for hard-error tolerance. Our scheme combines hardware reconfiguration to ensure reduced functionality of cores, and a runtime layer of software (microvisor) to manage mapping of threads to cores. Microvisor observes the changing phase behavior of threads and initiates thread relocation to match the computational demands of threads to the capabilities of cores. Our results show that in the presence of degraded cores, microvisor mitigates performance losses by an average of two percent.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers

Lead the way for us

Journal: IEEE Transactions on Computers	Publication Date: May 1, 2010
Citations: 48

Similar Papers

Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors
O. Khan ... S. Kundu
IEEE Transactions on Dependable and Secure Computing | VOL. 8
O. Khan, et. al.O. Khan ... S. Kundu
01 Sep 2011
IEEE Transactions on Dependable and Secure Computing | VOL. 8

Architectural core salvaging in a multi-core processor for hard-error tolerance
Michael D Powell ... Shantanu Gupta
-
Michael D Powell, et. al.Michael D Powell ... Shantanu Gupta
20 Jun 2009
20 Jun 2009

Architectural core salvaging in a multi-core processor for hard-error tolerance
Michael D Powell ... Shubhendu S Mukherjee
ACM SIGARCH Computer Architecture News | VOL. 37
Michael D Powell, et. al.Michael D Powell ... Shubhendu S Mukherjee
15 Jun 2009
ACM SIGARCH Computer Architecture News | VOL. 37

Tolerating more hard errors in MLC PCMs using compression
Majid Jalili ... Hamid Sarbazi-Azad
-
Majid Jalili, et. al.Majid Jalili ... Hamid Sarbazi-Azad
01 Oct 2016
01 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Computers