Resilient Optically Connected Memory Systems Using Dynamic Bit-Steering [Invited

Daniel Brunina,Keren Bergman,Dawei Liu,Ajay S Garg,Caroline P Lai

doi:10.1364/jocn.4.00b151

Daniel Brunina, Keren Bergman + Show 3 more

https://doi.org/10.1364/jocn.4.00b151

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700× improvement in resilience.

Full Text