On top of the wear-out failures and external particle interventions, voltage scaling to mitigate the power consumption in multiprocessor makes cache more vulnerable to cell failures. For the indispensable voltage reduction to prolong the battery life of handheld devices, fault tolerance techniques are extremely important to ensure fault free execution in near-threshold voltage. Several fault tolerance techniques have been proposed and the remapping based techniques are found to be effective to address the issue of fault tolerance in single core systems. This work proposes an analytical model for remapping based fault tolerance techniques to evaluate the effectiveness of such schemes in multicore systems. The metrics Expected Miss Ratio in Multicore (EMRMC) and Expected Latency Ratio in Multicore (ELRMC), are introduced to characterize the behavior of remapping based techniques. The EMRMC and ELRMC are defined as the function of probability of cell failure (Pfail), block size, number of cores and threads. The system is simulated in Multi2sim 5.0, a multicore CPU-GPU simulator. The values of the metrics for different configuration parameters like probability of cell failure, number of cores, number of blocks, block size and number of threads are analysed for framing the guidelines of system configuration to deliver better performance in remapping based fault tolerance. It is observed that the EMRMC is proportional to Pfail and block size but inversely proportional to the number of cores and threads and it is not affected by the number of blocks. On the contrary, the ELRMC is inversely proportional to Pfail and block size and proportional to the number of cores and threads. It is also observed that the ELRMC is independent of the number of cores and blocks. EMRMC is best minimized for Pfail ≤ 1e-4, block size ≤ 64 bytes, number of cores ≥ 4 and number of threads ≥ 2. On the other hand, ELRMC is best observed for Pfail ≤ 1e-4, block size ≥ 64 bytes, number of cores ≥ 4 and number of threads 2.
Read full abstract