Tag Replication and Status Bits Encoding for Enhancing Cache Metadata Reliability
On‐chip caches occupy a significant portion of modern processors, making them increasingly vulnerable to soft errors as technology scales. While data arrays often receive robust protection via error‐correcting codes (ECCs), metadata elements such as tag fields and status bits remain inadequately protected despite their critical role in ensuring memory integrity. This paper proposes two lightweight techniques to enhance cache metadata reliability: (1) a robust three‐bit encoding scheme for status bits that tolerates single‐bit flips without data corruption and (2) a selective tag replication scheme for dirty cache blocks, enabling reliable recovery from single‐bit errors in tag arrays. Simulation results on SPEC 2006 benchmarks show that our approach recovers 97.3% of injected soft errors in cache metadata, outperforming conventional SECDED protection (93.8%) with significantly lower overhead. The proposed design incurs only 0.50% area and 1.67% dynamic power overhead on a data L1 cache. Moreover, the proposed techniques can be extended to support common cache coherence protocols in multicore systems with minimal modification.
- Research Article
26
- 10.1109/tvlsi.2011.2111469
- Apr 1, 2012
- IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Protecting on-chip cache memories against soft errors has become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have mainly focused on improving the reliability of the cache data arrays. Due to its crucial importance to the correctness of cache accesses, the tag array also demands high reliability against soft errors. Exploiting the address locality of memory accesses, we propose to duplicate most recently accessed tag entries in a small tag replication buffer (TRB) thus to protect the information integrity of the tag array in the data cache. Experimental results show that our proposed TRB scheme achieves a high 90% access-with-replica (AWR) rate with low performance (~0%), energy (16.3%), and area (19.9%) overheads. We also conduct a detailed design space exploration for the TRB design and propose a selective TRB scheme that achieves a higher AWR rate (97.4%) for the dirty cachelines with negligible overheads. To provide a comprehensive evaluation of the tag-array reliability, we further conduct an architectural vulnerability factor (AVF) analysis for the tag array in the data cache and propose a refined metric, detected-without-replica-AVF (DOR-AVF), which combines the AVF and AWR analysis. Based on our DOR-AVF analysis, a selective TRB scheme with early write-back (S-TRB-EWB) is proposed, which achieves a zero DOR-AVF and 100% AWR rate at a negligible performance overhead. Results from statistical fault/error injection experiment also confirm the effectiveness of our TRB schemes and the achieved reliability of the cache tag array that recovers 100% of detected errors.
- Conference Article
5
- 10.1109/isvlsi.2010.25
- Jul 1, 2010
Protecting the on-chip cache memories against soft errors has become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have mainly focused on improving the reliability of the cache data arrays. Due to its crucial importance to the correctness of cache accesses, the tag array demands high reliability against soft errors while the data array is fully protected. Exploiting the address locality of memory accesses, we propose to duplicate most recently accessed tag entries in a small Tag Replication Buffer (TRB) thus to protect the information integrity of the tag array in the data cache with low performance, energy and area overheads. A Selective-TRB scheme is further proposed to protect only tag entries of dirty cache lines. The experimental results show that the Selective-TRB scheme achieves a higher access-with-replica (AWR) rate of 97.4% for the dirty-cache line tags. To provide a comprehensive evaluation of the tag-array reliability, we also conduct an architectural vulnerability factor (AVF) analysis for the tag array and propose a refined metric, detected-without-replica-AVF (DOR-AVF), which combines the AVF and AWR analysis. Based on our DOR-AVF analysis, a TRB scheme with early write-back (EWB) is proposed, which achieves a zero DOR-AVF at a negligible performance overhead.
- Conference Article
6
- 10.23919/date.2018.8467758
- Mar 1, 2018
Soft errors in on-chip caches are the major cause of processors failure. Partitioning the cache into data and tag arrays, recent reports show that the vulnerability of the latter is as high as or even higher than that of the former. Although Error-Correcting Codes (ECCs) are widely used to protect the data array, their overheads are not affordable in the tag array and its protection is conventionally limited to parity code. In this paper, we propose Similarity-Managed Robust Tag (SMARTag) technique to provide the error correction capability in parity-protected tags. SMARTag exploits the inherent similarity between the upper parts of the tags in a cache set to share these parts between addresses and ECCs. Using SMARTag, the cache access time is intact since the ECC part is bypassed in normal cache operation and no extra memory is required since ECCs are stored in available tag space. The simulation results show that SMARTag is capable of correcting more than 98% of errors in the tag array, on average, and its energy consumption, area, and performance overhead is less than 0.2%.
- Research Article
- 10.1016/s0141-9331(01)00137-5
- Jan 1, 2002
- Microprocessors and Microsystems
Decoupling of data and tag arrays for on-chip caches
- Research Article
13
- 10.1109/jssc.2004.837992
- Jan 1, 2005
- IEEE Journal of Solid-State Circuits
This paper presents the architecture and circuit techniques for a reconfigurable SRAM building block. The memory block can emulate many memory structures including a cache tag or data array, a FIFO, and a simple scratchpad memory. We choose the block size based on the optimal partition size for large SRAM structures, use self-resetting and replica timing circuit techniques, and add flexible status bits and a few hardwired functional blocks to support reconfigurability. A 16-kb prototype design fabricated in a 0.18 /spl mu/m technology cycles at 1.1 GHz at the nominal 1.8 V supply and room temperature. The additional logic used for reconfigurability consumes 32 % of the area and 23 % of the power of the memory block. We project that these overhead percentages would fall below 15% and 10%, respectively, for a 64-kb memory.
- Conference Article
9
- 10.1109/mcsoc2018.2018.00035
- Sep 1, 2018
Soft errors are expecting to be accelerated with the shrinking of feature sizes due to low operating voltages and high circuit density. However, soft error rates per single-bit is expectedly reduced with technology scaling. With tight requirements for the area and energy consumption, using a low complexity and high coding rate error correction code (ECC) to handle soft errors in on-chip communication is necessary. In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work as a parity check to detect single event upset (SEU) inside each flit. Then, to reduce the needed retransmission, a Razor flip-flop with parity check (RFF-w-P) is proposed to work with PPC. Since PPC can act like forward error correction (FEC), we also present a selective transmission in bit-indexes by using a transposable FIFO. Therefore, the proposed mechanism not only guarantee single error detection/correction but also provide 2+ error correction as FEC. The proposed work also reduce the area cost of FIFO in comparison to traditional coding methods and adapt too multiple error rates.
- Conference Article
- 10.1109/ats.2011.71
- Nov 1, 2011
Memories constitute increasing proportions of most digital systems and memory-intensive chips lead the migration to new nanometer fabrication processes. With each process generation, process variations and defect rates are increasing, at the same time, cells are becoming more susceptible to soft errors with technology shrink. SRAMs will thus require increasing numbers of spares and stronger error correcting codes (ECCs), incurring higher area overheads and access-time penalties. Our overall objective is to develop new systematic approaches for designing defect-tolerant 6T-SRAMs optimized in terms of yield-per-area under high defect rates and high soft error rates, for given soft-error resilience and access-time requirements. In this paper, we analyze the key tradeoffs associated with using different numbers of spares and ECCs with different strengths. In addition to considering the usual role of each -- i.e., spares to combat defects and ECC to combat soft errors -- we also consider the ability of ECC to combat those defects which cannot be masked using available spares. We develop a new model that captures the benefits -- yield and resilience to soft errors -- of spares and ECC in an integrated manner. We also characterize area and access time overheads of the spares and the ECC scheme. We then integrate above into a framework to design 6T-SRAMs that optimizes yield-per-area. We demonstrate that the proposed approach provides dramatic improvements in yield and yield-per-area without compromising resilience to soft errors.
- Book Chapter
3
- 10.1007/978-981-15-0829-5_32
- Dec 17, 2019
The demand of higher capacity, smaller size, and reliable memory is increasing with the continuous scaling of semiconductor technology with time. But reliability of memory is greatly influenced by soft errors caused due to radiation effects. These soft errors lead to corruption of data stored in one or multiple cells of memory. Error Correction Codes (ECCs) are frequently employed for mitigating the effects of soft errors in memories. Single Error Correction-Double Error Detection-Double Adjacent Errors Correction (SEC-DED-DAEC) code is one of the popularly known ECC schemes which is employed when Multiple Bit Upsets (MBUs) occur in memory. In this paper, a new SEC-DED-DAEC code has been proposed for memory applications. The proposed codecs have been designed and synthesized in FPGA platform for some common word lengths frequently used in memory applications. The performance of proposed codecs have been compared with other related works. The proposed codecs require lesser area compared to existing codecs.
- Conference Article
3
- 10.1109/edcc.2015.30
- Sep 1, 2015
Error correction codes (ECCs) are commonly used in computer systems to protect information from errors. For example, single error correction (SEC) codes are frequently used for memory protection. Due to continuous technology scaling, soft errors on registers have become a major concern, and ECCs are required to protect them. Nevertheless, using an ECC increases delay, area and power consumption. In this way, ECCs are traditionally designed focusing on minimizing the number of redundant bits added. This is important in memories, as these bits are added to each word in the whole memory. However, this fact is less important in registers, where minimizing the encoding and decoding delay can be more interesting. This paper proposes a method to develop codes with 1-gate delay encoders and 4-gate delay decoders, independently of the word length. These codes have been designed to correct single errors only in data bits to reduce the overhead.
- Research Article
5
- 10.1007/s11390-011-1150-7
- May 1, 2011
- Journal of Computer Science and Technology
With continuous technology scaling, on-chip structures are becoming more and more susceptible to soft errors. Architectural vulnerability factor (AVF) has been introduced to quantify the architectural vulnerability of on-chip structures to soft errors. Recent studies have found that designing soft error protection techniques with the awareness of AVF is greatly helpful to achieve a tradeoff between performance and reliability for several structures (i.e., issue queue, reorder buffer). Cache is one of the most susceptible components to soft errors and is commonly protected with error correcting codes (ECC). However, protecting caches closer to the processor (i.e., L1 data cache (L1D)) using ECC could result in high overhead. Protecting caches without accurate knowledge of the vulnerability characteristics may lead to over-protection. Therefore, designing AVF-aware ECC is attractive for designers to balance among performance, power and reliability for cache, especially at early design stage. In this paper, we improve the methodology of cache AVF computation and develop a new AVF estimation framework, soft error reliability analysis based on SimpleScalar. Then we characterize dynamic vulnerability behavior of L1D and detect the correlations between L1D AVF and various performance metrics. We propose to employ Bayesian additive regression trees to accurately model the variation of L1D AVF and to quantitatively explain the important effects of several key performance metrics on L1D AVF. Then, we employ bump hunting technique to reduce the complexity of L1D AVF prediction and extract some simple selecting rules based on several key performance metrics, thus enabling a simplified and fast estimation of L1D AVF. Based on the simplified and fast estimation of L1D AVF, intervals of high L1D AVF can be identified online, enabling us to develop the AVF-aware ECC technique to reduce the overhead of ECC. Experimental results show that compared with traditional ECC technique which provides complete ECC protection throughout the entire lifetime of a program, AVF-aware ECC technique reduces the L1D access latency by 35% and saves power consumption by 14% for SPEC2K benchmarks averagely.
- Research Article
50
- 10.1109/tcad.2012.2226585
- Mar 1, 2013
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Error correction codes (ECCs) have been used for decades to protect memories from soft errors. Single error correction (SEC) codes that can correct 1-bit error per word are a common option for memory protection. In some cases, SEC codes are extended to also provide double error detection and are known as SEC-DED codes. As technology scales, soft errors on registers also became a concern and, therefore, SEC codes are used to protect registers. The use of an ECC impacts the circuit design in terms of both delay and area. Traditional SEC or SEC-DED codes developed for memories have focused on minimizing the number of redundant bits added by the code. This is important in a memory as those bits are added to each word in the memory. However, for registers used in circuits, minimizing the delay or area introduced by the ECC can be more important. In this paper, a method to construct low delay SEC or SEC-DED codes that correct errors only on the data bits is proposed. The method is evaluated for several data block sizes, showing that the new codes offer significant delay reductions when compared with traditional SEC or SEC-DED codes. The results for the area of the encoder and decoder also show substantial savings compared to existing codes.
- Research Article
16
- 10.1109/tdmr.2014.2332616
- Sep 1, 2014
- IEEE Transactions on Device and Materials Reliability
Cache memories are very relevant components in modern processors, and therefore, their protection against soft errors is important to ensure reliability. One important element in caches is the tag fields, which are critical to keep data integrity and achieve a high hit ratio. To protect them against soft errors, a parity bit or a single error correction (SEC) code is commonly used. In that case, on each cache access, the parity bit is checked or the SEC code decoded on each cache way to detect and correct errors. In this paper, FastTag, a novel approach to protect cache tags is presented and evaluated. The proposed scheme computes the parity or SEC bits on the incoming address and compares the result with the tag and parity bits stored in each cache way. This avoids parity recomputation or SEC decoding, thus reducing the circuit complexity. This is corroborated by the evaluation results that show how FastTag requires an area, delay, and power overhead much lower than the conventional techniques that are currently used.
- Discussion
7
- 10.1016/j.microrel.2017.12.017
- Jan 4, 2018
- Microelectronics Reliability
Fault tolerant encoders for Single Error Correction and Double Adjacent Error Correction codes
- Research Article
9
- 10.1109/tetc.2019.2953139
- Nov 22, 2019
- IEEE Transactions on Emerging Topics in Computing
As technology scales, radiation induced soft errors create more complex error patterns in memories with a single particle corrupting several bits. This poses a challenge to the Error Correction Codes (ECCs) traditionally used to protect memories that can correct only single bit errors. During the last decade, a number of codes have been developed to correct the emerging error patterns, focusing initially on double adjacent errors and later on three bit burst errors. However, as the memory cells get smaller and smaller, the error patterns created by radiation will continue to change and thus new codes will be needed. In addition, the memory layout and the technology used may also make some patterns more likely than others. For example, in some memories, there maybe elements that separate blocks of bits in a word, making errors that affect two blocks less likely. Finally, for a given memory, depending on the data stored, some error patterns may be more critical than others. For example, if numbers are stored in the memory, in most cases, errors on the more significant bits have a larger impact. Therefore, for a given memory and application, to achieve optimal protection, we would like to have a code that corrects a given set of patterns. This is not possible today as there is a limited number of code choices available in terms of correctable error patterns and word lengths. However, most of the codes used to protect memories are linear block codes that have a regular structure and which design can be automated. In this paper, we propose the automation of error correction code design for memory protection. To that end, we introduce a software tool that given a word length and the error patterns that need to be corrected, produces a linear block code described by its parity check matrix and also the bit placement. The benefits of this automated design approach are illustrated with several case studies. Finally, the tool is made available so that designers can easily produce custom error correction codes for their specific needs.
- Research Article
2
- 10.25073/2588-1086/vnucsce.218
- Jun 2, 2019
- VNU Journal of Science: Computer Science and Communication Engineering
The soft error rates per single-bit due to alpha particles in sub-micron technology is expectedly reducedas the feature size is shrinking. On the other hand, the complexity and density of integrated systems are accelerating which demand ecient soft error protection mechanisms, especially for on-chip communication. Using soft error protection method has to satisfy tight requirements for the area and energy consumption, therefore a low complexity and low redundancy coding method is necessary. In this work, we propose a method to enhance Parity Product Code (PPC) and provide adaptation methods for this code. First, PPC is improved as forward error correcting using transposable retransmissions. Then, to adapt with dierent error rates, an augmented algorithm for configuring PPC is introduced. The evaluation results show that the proposed mechanism has coding rates similar to Parity check’s and outperforms the original PPC.Keywords
 Error Correction Code, Fault-Tolerance, Network-on-Chip.
 References
 [1] R. Baumann, Radiation-induced soft errors in advanced semiconductor technologies, IEEETransactions on Device and materials reliability. 5-3 (2005) 305–316. https://doi.org/10.1109/tdmr.2005.853449.[2] N. Seifert, B. Gill, K. Foley, P. Relangi, Multi-cell upset probabilities of 45nm high-k + metal gateSRAM devices in terrestrial and space environments, in: IEEE International Reliability Physics Symposium 2008, IEEE, AZ, USA, 2008, pp. 181–186.[3] S. Lee, I. Kim, S. Ha, C.-s. Yu, J. Noh, S. Pae, J. Park, Radiation-induced soft error rate analyses for 14 nmFinFET SRAM devices, in: 2015 IEEE International Reliability Physics Symposium (IRPS), IEEE, CA, USA, 2015, pp. 4B–1.[4] R. Hamming, Error detecting and error correcting codes, Bell Labs Tech. J. 29-2 (1950) 147–160. https://www.doi.org/10.1002/j.1538-7305.1950.tb00463.x.[5] M. Hsiao, A class of optimal minimum odd-weight-column SEC-DED codes, IBMJ. Res. Dev. 14-4 (1970) 395–401. https://www.doi.org/10.1147/rd.144.0395.[6] S. Mittal, M. Inukonda, A survey of techniques for improving error-resilience of dram, Journal ofSystems Architecture. 91-1 (2018) 11–40. https://www.doi.org/10.1016/j.sysarc.2018.09.004.[7] D. Bertozzi, et al., Error control schemes for on-chip communication links: the energy-reliabilitytradeo, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 24-6 (2005) 818–831. https://doi.org/10.1109/tcad.2005. 847907.[8] F. Chiaraluce, R. Garello, Extended Hamming product codes analytical performance evaluation for low errorrate applications, IEEE Transactions on Wireless Communications. 3-6 (2004) 2353–2361. https://doi. org/10.1109/twc.2004.837405.[9] R. Pyndiah, Near-optimum decoding of product codes: Block turbo codes, IEEE Transactions onCommunications. 46-8 (1998) 1003–1010. https://www.doi.org/10.1109/26.705396.[10] N. Magen, A. Kolodny, U. Weiser, N. Shamir, Interconnect-power dissipation in a microprocessor,in: Proceedings of the 2004 international workshop on System level interconnect prediction, ACM, Paris,France, 2004, pp. 7–13.[11] K. Dang, X. Tran, Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-ChipCommunication, in: Proceeding of 2018 IEEE 11th International Symposium on EmbeddedMulticore/Many-core Systems-on-Chip, IEEE, Hanoi, Vietnam, 2018, pp. 1–6.[12] L. Saiz-Adalid, et al., MCU tolerance in SRAMs through low-redundancy triple adjacent error correction, IEEE Transactions on VLSI Systems. 23-10 (2015) 2332–2336. https://www.doi.org/10.1109/tvlsi.2014.2357476.[13] W. Peterson, D. Brown, Cyclic codes for error detection, Proceedings of the IRE 49-1 (1961)228–235. https://www.doi.org/10.1109/jrproc.1961.287814.[14] S. Wicker, V. Bhargava, Reed-Solomon Codes and Their Applications, first ed., JohnWiley and Sons, NJ,USA, 1999.[15] I. Reed, X. Chen, Error-control coding for data networks, first ed., Springer Science and BusinessMedia, New York, 2012.[16] L. Peterson, B. Davie, Computer networks: a systems approach, fifth ed., Elsevier, New York, 2011.[17] K. Dang, et al., Soft-error resilient 3D Network-on-Chip router, in: 2015 IEEE 7thInternational Conference on Awareness Science and Technology (iCAST), China, 2015, pp. 84–90.[18] K. Dang, et al., A low-overhead soft–hard fault-tolerant architecture, design and managementscheme for reliable high-performance many-core 3D-NoC systems, The Journal of Supercomputing.73-6 (2017) 2705–2729. https://www.doi.org/10.1007/s11227-016-1951-0.[19] D. Ernst, et al., Razor: A low-power pipeline based on circuit-level timing speculation, in: The36th annual IEEE/ACM International Symposium on Microarchitecture, IEEE, CA, USA, 2003, pp. 10–20.[20] H. Mohammed, W. Flayyih, F. Rokhani, Tolerating permanent faults in the input port of the network onchip router, Journal of Low Power Electronics and Applications. 9-1 (2019) 1–11. https://www.doi.org/10.3390/jlpea9010011.[21] G. Hubert, L. Artola, D. Regis, Impact of scaling on the soft error sensitivity of bulk, FDSOI and FinFETtechnologies due to atmospheric radiation, Integration, the VLSI journal. 50 (2015) 39–47. https://www.doi.org/10.1016/j.vlsi.2015.01.003.[22] J.-s. Seo, et al., A 45nm cmos neuromorphic chip with a scalable architecture for learning in networks of spiking neurons, in: 2011 IEEE Custom Integrated Circuits Conference (CICC), IEEE, CA, USA, 2011, pp. 1–4.[23] NanGate Inc., Nangate Open Cell Library 45 nm. http://www.nangate.com, (accessed 16.06.16) (2016).
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.