Nonvolatile memory express (NVMe) is a high-performance and scalable PCI express (PCIe)-based interface for the host software communicating with NVMs, including NAND Flash and the storage class memories (SCMs). NVMe solid-state drives (SSDs) have been deployed in cloud platforms and data-centers for a variety of I/O intensive applications due to their performance benefits compared to SATA/SAS SSDs. Considering the design flexibility, firmware-based NVMe controllers are typically used in Flash-based NVMe SSDs but may occupy a significant portion of processor resources and power consumption to achieve high performance. Moreover, the firmware component can be a critical performance bottleneck for SCMs that are an order-of-magnitude faster than Flash. To address these challenges, hardware-accelerated NVMe controllers have emerged in both industry and academia. The commercial hardware controllers are confidential, whereas current academic studies still spare much room for architecture innovations. In this article, we propose an opensource ultralow-latency and high-throughput NVMe controller with a highly parallel, pipelined, and scalable architecture that accommodates one admin controller and multiple fully hardware-automated I/O controllers. We perform extensive empirical performance evaluations concerning the NVMe I/O size, queue depth, queue number, read-to-write ratio, and access pattern. The maximum read/write bandwidth can achieve 7.0 GB/s, accounting for 89% of the PCIe bandwidth. The 4-KB-sized read/write throughput can attain 1.7 million I/O operations per second (MIOPS), whereas the average latency is merely 2.4 <inline-formula> <tex-math notation="LaTeX">$\mu \text{s}$ </tex-math></inline-formula>/3.2 <inline-formula> <tex-math notation="LaTeX">$\mu \text{s}$ </tex-math></inline-formula>. Compared to state-of-the-art NVMe controllers in academia, the 4-KB-sized read/write bandwidth of our controller reaches <inline-formula> <tex-math notation="LaTeX">$2.2 \times /2.3\times $ </tex-math></inline-formula> as high and the latency is <inline-formula> <tex-math notation="LaTeX">$5.1 \times /4.9\times $ </tex-math></inline-formula> lower.
Read full abstract