Abstract

COMPUTER architecture is a wide active research field that spans all aspects of computer systems design. Dependability the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers has always been a key aspect of computer architecture and has been extensively investigated since the first days of computing. Traditionally, dependable computer architectures are utilized in high-end computing systems or computing systems for critical applications, where continuous, uninterrupted operation is among the most important requirements and systems are characterized by high reliability, availability, and maintainability (wellknown measures to quantify dependability). The time has now come for the wealth of concepts and methodologies in the field of dependable (also known as fault-tolerant) computer architecture that have been proposed during the last several decades, to be adopted and extended in mainstream, general-purpose computing systems. Dependable operation of computing systems is a mandatory requirement in virtually all application fields (at lower or higher costs), due to the increasing reliance of everyday human activities on computers or microprocessor-based systems in general. Unfortunately, this ubiquitous computing revolution comes in hand with hard-tosolve technological issues that are closely related to dependable operation of a computing system. Integrated circuits are implemented today in miniaturized and inherently unreliable technologies that render circuits more vulnerable to both temporary disturbances leading to transient (or soft) errors and to permanent (or hard) errors. Soft errors in silicon-based circuits are caused by alpha particles from integrated circuit package decay or by cosmic rays that create high energy neutrons and protons. Hard errors, on the other side, appear either because of manufacturing defects that escape high-volume production manufacturing testing or because of material aging and wearout mechanisms during the system’s life cycle that are exacerbated by the high clock frequencies of modern circuits. Computing systems are getting constantly more complex (in particular with the recent turn toward multicore processors and high performance memory systems) while on the same time strict time-to-market constraints demand for extremely short design, verification, and validation intervals. The net outcome of all the previous factors is that dependable operation of computing systems in the field should be a first level design consideration in all application domains, obviously at different cost points. This special section of IEEE Transactions on Computers focuses on architectural techniques that enhance dependability of different components of a computing system. A total number of 27 high-quality manuscripts were submitted to the special section from academic and industrial research groups worldwide, and a large set of more than 140 reviews from a strong group of expert reviewers was required to make final decisions. This special section published in the current issue of IEEE Transactions on Computers includes seven of the papers; roughly one fourth of the submitted papers on this topic. The seven papers of the special section on Dependable Computer Architecture cover a wide spectrum of subsystems of mainstream architectures: processors (in particular CMP), cache memories, flash, and disk storage, but also a paper on probabilistic (stochastic) architectures is included. Methodologies are proposed for effective handling of both hard and soft errors and the papers of the special section comprehensively discuss all major tradeoffs between dependability and other design aspects such as performance, cost, yield, energy/power. The first paper entitled “StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs”, by Shantanu Gupta, Shuguang Feng, Amin Ansari, and Scott Mahlke from the University of Michigan, Ann Arbor, opens the special section dealing with tolerance of progressively higher defect densities due to aging in massively parallel chip multiprocessors (CMPs). The authors present StageNet, a reconfigurable CMP architecture that is primarily designed to provide fault tolerance and extend the lifetime of the system which gracefully degrades as the number of permanent faults increases. The subsequent three papers focus on different aspects of cache memories with respect to dependability. The second paper of the special section entitled “Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache”, by Somnath Paul, Fang Cai, Xinmiao Zhang, and Swarup Bhunia, from Case Western Reserve University, discusses a non-uniform, variable ECC allocation scheme to effectively tolerate multiple bit errors in cache memories. The method utilizes post-fabrication characterization information to provide different ECC allocation depending on the relative vulnerability of cache memory blocks. IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 1, JANUARY 2011 3

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call