Flash storage systems have exhibited great benefits over magnetic hard drives such as low input-output (I-O) latency, and high throughput. However, NAND flash based Solid-State Drives (SSDs) are inherently prone to soft errors from various sources, e.g., wear-out, program and read disturbance, and hot-electron injections. To address this issue, flash devices employ different error-correction codes (ECC) to detect and correct soft errors. Using ECC induces non-trivial overhead costs in terms of flash area, performance, and energy consumption. In this work, we evaluate the feasibility of reducing the need for strong ECC while maintaining the correct execution of the applications. Specifically, we explore data-level error tolerance in various data-centric applications, and study the system implications for designing a low-cost yet high performance flash storage system, SoftFlash. We explore three key aspects of enabling SoftFlash. First, we design an error modeling framework that can be used in runtime for monitoring and estimating the error rates of real-world flash devices. Our experiments show that the error rate of SSDs can be modeled with reasonable accuracy (13%) using parameters accessible from operating systems. Second, we carry out extensive fault-injection experiments on a wide range of applications including multimedia, scientific computation, and cloud computing to understand the requirements and characteristics of data level error tolerance. We find that the data from these applications show high error resiliency, and can produce acceptable results even with high error rates. Third, we conduct a case study to show the benefits of leveraging data-level error tolerance in flash devices. Our results show that, for many data-centric applications, the proposed SoftFlash system can achieve acceptable results (or better in certain cases), with more than a 40% performance improvement, and a third of the energy consumption.
Read full abstract