SmartFVM: A Fast, Flexible, and Scalable Hardware-based Virtualization for Commodity Storage Devices

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

A computational storage device incorporating a computation unit inside or near its storage unit is a highly promising technology to maximize a storage server’s performance. However, to apply such computational storage devices and take their full potential in virtualized environments, server architects must resolve a fundamental challenge: cost-effective virtualization . This critical challenge can be directly addressed by the following questions: (1) how to virtualize two different hardware units (i.e., computation and storage), and (2) how to integrate them to construct virtual computational storage devices, and (3) how to provide them to users. However, the existing methods for computational storage virtualization severely suffer from their low performance and high costs due to the lack of hardware-assisted virtualization support. In this work, we propose SmartFVM-Engine , an FPGA card designed to maximize the performance and cost-effectiveness of computational storage virtualization. SmartFVM-Engine introduces three key ideas to achieve the design goals. First, it achieves high virtualization performance by applying hardware-assisted virtualization to both computation and storage units. Second, it further improves the performance by applying hardware-assisted resource orchestration for the virtualized units. Third, it achieves high cost-effectiveness by dynamically constructing and scheduling virtual computational storage devices. To the best of our knowledge, this is the first work to implement a hardware-assisted virtualization mechanism for modern computational storage devices.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.eswa.2024.123570
Design and performance analysis of modern computational storage devices: A systematic review
  • Feb 28, 2024
  • Expert Systems With Applications
  • Sushama Annaso Shirke + 2 more

Design and performance analysis of modern computational storage devices: A systematic review

  • Research Article
  • Cite Count Icon 32
  • 10.1186/s40537-019-0265-5
Computational storage: an efficient and scalable platform for big data and HPC applications
  • Nov 15, 2019
  • Journal of Big Data
  • Mahdi Torabzadehkashi + 5 more

In the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3697352
A Portable Linux-based Firmware for NVMe Computational Storage Devices
  • Feb 8, 2025
  • ACM Transactions on Storage
  • Rick Wertenbroek + 2 more

Over the years, interest in computational storage devices has been growing steadily. This is largely due to the rise of data-intensive applications, such as machine learning, online video distribution, astrophysics, and genomics. Moving compute operations closer to the data provides benefits in terms of scaling possibilities and energy efficiency. The development of computational storage devices has been limited by the need for specialized and complex hardware. In this work, we propose a portable Linux-based firmware framework for the development of NVMe computational storage devices. Our firmware runs on a variety of hardware platforms ranging from expensive FPGA solutions to inexpensive off-the-shelf single board computers. The firmware leverages the vast Linux software ecosystem to facilitate the development and prototyping of novel computational storage devices. We benchmark our firmware on multiple hardware platforms and demonstrate its versatility through several computational examples including a content-aware disk image search engine based on natural language processing and AI-driven image recognition.

  • Research Article
  • Cite Count Icon 19
  • 10.1109/jiot.2023.3247640
Health Monitoring and Diagnosis for Geo-Distributed Edge Ecosystem in Smart City
  • Nov 1, 2023
  • IEEE Internet of Things Journal
  • Wu Wen + 6 more

With the increasing number of Internet of things (IoT) devices being deployed and used in daily life, the load on computational devices has grown exponentially. This situation is more prevalent in smart cities where such devices are used for autonomous control and monitoring. Smart cities have different kinds of applications that are aided through IoT devices that collect data, send it to computational processing and storage devices, and get back decisions or actuate the actions based on the input data. There has been a stringent requirement to reduce the end-to-end delay in this process owing to the remote deployment of cloud data centres. This eventually led to the revolution of edge computing, wherein nano-micro-processing devices can be deployed closer to the premises of the smart application and process the data generated with a lower turnaround time. However, due to the limited computational power and storage, controlling the workload diverted to the edge devices has been challenging. The workload scheduling policies and task allocation schemes often fail to consider the run time health of the edge devices due to a lack of proper monitoring infrastructure. Thus, in this paper, we proposed a health monitoring and diagnosis framework for geo-distributed edge clusters processing big data generated by smart city applications. This framework is built over the Map-Reduce approach for distributed processing of big data on edge clusters deployed across the smart city. Within this framework, SmartMonit (a monitoring agent) is deployed that collects the health statistics of edge devices and predicts the potential failures using an artificial neural network-based self-organising maps approach. The proposed framework is deployed over different clusters to test the efficacy concerning failure detection.

  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.memori.2023.100051
A review on computational storage devices and near memory computing for high performance applications
  • Apr 28, 2023
  • Memories - Materials, Devices, Circuits and Systems
  • Dina Fakhry + 3 more

A review on computational storage devices and near memory computing for high performance applications

  • Conference Article
  • Cite Count Icon 12
  • 10.1145/3400302.3415699
HyperTune
  • Nov 2, 2020
  • Ali Heydarigorji + 5 more

Distributed training is a novel approach to accelerating training of Deep Neural Networks (DNN), but common training libraries fall short of addressing the distributed nature of heterogeneous processors or interruption by other workloads on the shared processing nodes. This paper describes distributed training of DNN on computational storage devices (CSD), which are NAND flash-based, high-capacity data storage with internal processing engines. A CSD-based distributed architecture incorporates the advantages of federated learning in terms of performance scalability, resiliency, and data privacy by eliminating the unnecessary data movement between the storage device and the host processor. The paper also describes Stannis, a DNN training framework that improves on the shortcomings of existing distributed training frameworks by dynamically tuning the training hyperparameters in heterogeneous systems to maintain the maximum overall processing speed in term of processed images per second and energy efficiency. Experimental results on image classification training benchmarks show up to 3.1x improvement in performance and 2.45x reduction in energy consumption when using Stannis plus CSD compare to the generic systems.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/qsic.2004.15
Conceptual modeling: a key to quality information systems
  • Sep 8, 2004
  • Arne Sølvberg

Summary form only given. Computational devices, communication systems and storage devices are becoming commodities. Moore's law is still valid, and price/performance for the equipment decreases by the month. Computers are so deeply engrained in the fabric of our societies that they seem to have disappeared as distinct devices. Software, humans and all kinds of intelligent artifacts are interwoven in information systems of interacting, autonomous subsystems. One of the great challenges ahead is to manage technical complexity. Another is to be able to easily change, in our human societies, what we do and how we do it. Low ability to master technical complexity together with low ability to change our ways, spells disaster. We need to have systems that can evolve as the needs and desires of individuals and organizations evolve. We need to build our societies such that they can change as new technology makes new developments possible. It is difficult to forecast the future. We, nevertheless, speculate about current trends and tendencies and about possible consequences. The quality issue is central. As humans and computers increasingly act together, semantic and pragmatic quality issues become more important. One consequence seems to be that information systems of the future to a larger degree than contemporary systems must carry with them an explicit model of the world that they operate in, a model of what the data that they carry stand for. The modeling aspect is discussed both with respect to information repositories, information dissemination, and to information processing in knowledge intensive work.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/ics51289.2020.00048
A Hybrid Computational Storage Architecture to Accelerate CNN Training
  • Dec 1, 2020
  • Chun-Zhang Zheng + 1 more

With the rapid development of storage devices, the computational storage device (CSD) can use NAND flash memory to store data and is equipped with powerful CPU or hardware accelerators to efficiently perform calculations and operations on internal data. In the paper, we will utilize the CSD to accelerate the CNN training that mainly contains the convolution, pooling and training work. We not only reduce the amount of data transmission between the host and the storage by bringing the convolution and the pooling work to the CSD, but also bring the training work to the powerful CSD in a hybrid computational storage architecture. According to the experimental results, the amount of data transmission and the total execution time can be reduced significantly.

  • Research Article
  • Cite Count Icon 3
  • 10.1145/366786.366788
A preplanned approach to a storage allocating compiler
  • Oct 1, 1961
  • Communications of the ACM
  • Robert W O'Neill

The preplanned approach to the storage allocation problem involves using a fixed method of analysis of a problem to produce an efficient computer program incorporating all necessary transfers of information within the multilevels of storage of the computer throughout the running of the object program. The initial description of the problem may be in any suitable source language (FORTRAN, ALGOL, etc.) but should not require any recognition of the limitations caused by the number, size, and speeds of the computer's storage devices (core, tape, disc, number of data channels, etc.) by the programmer. The object program produced should contain all necessary implementing instructions to utilize all of the computer's storage devices in such a manner as to minimize the cost of the program (i.e. maximize the speed of problem solving).

  • Conference Article
  • Cite Count Icon 4
  • 10.1117/12.421067
<title>Toward ubiquitous mining of distributed data</title>
  • Mar 27, 2001
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Rajeev Ayyagari + 3 more

The demand for understanding and exploring large quantity of data is growing fast in many domains. Scientific research is one among them. While the role of high performance computers in scientific data analysis is important, networks of workstations and the so called “thin” computing devices like the laptops, palmtops, and wearable computers are playing increasingly important roles in this domain. This chapter presents an overview of a collection of techniques that are designed for analyzing heterogeneous data distributed over a network of different computing and storage devices. The collective data mining approach presented here, pays careful attention to the overhead of data communication in a heterogeneous network and offers the capability of ubiquitous mining from distributed data.

  • Book Chapter
  • 10.5772/8868
Administration and Monitoring Service for Storage Virtualization in Grid Environments
  • Apr 1, 2010
  • Salam Traboulsi

In this paper, we have described a non-intrusive, scalable administration and monitoring system for high performance grid environment. It is a fundamental building block for achieving and analyzing the performance of the storage virtualization system in a huge and heterogeneous grid storage environment. It offers a very flexible and simple model that collects nodes state information and requirements needed by the other services of our virtualization storage system improving storage distributed data performance. It is based on a multi-tiered hierarchical architecture the start-up of the monitoring and administration system.

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/3349621.3355734
Poster
  • Oct 4, 2019
  • Paridhika Kayal + 1 more

Fog computing recently emerged as a novel distributed virtualized computing paradigm, where cloud services are extended to the edge of the network, thereby increasing network capacity and reducing latencies for distributed IoT applications. A fog network consists of communication between resource-constrained fog nodes, which are computational networking storage and acceleration devices. By adopting the microservice architecture, applications are designed as a collection of independent and loosely coupled modular services called microservices installed in application containers. The placement problem is to efficiently allocate limited fog resources to applications with diverse resource requirements. This determines the overall system performance in terms of energy consumption, communication cost, load balancing and others. Placement of microservices can be done in two ways which presents a tradeoff between two placement objectives. The first strategy is placing maximal communicating microservices on each fog node. This keeps the chaining costs between the microservices low, but at the same time, it leads to high utilization at some fog nodes. Also, this strategy may not be feasible due to limited resources at fog nodes. On the other hand, the second strategy is to split communicating microservices over a network of fog nodes. This leads to data exchange between the fog nodes, which is referred to as communication cost. This strategy results in a load-balanced system, but at the same time, increases communication costs. We want low energy consumption at fog nodes and low communication costs of the applications. Placement strategies for cloud computing are generally centralized and not well suited for decentralized fog systems. Therefore distributed solutions with self-organization and management capabilities are required for efficient allocation of fog resources to applications with diverse resource requirements.

  • Book Chapter
  • 10.1007/978-0-387-98144-4_2
Computer Storage and Arithmetic
  • Jan 1, 2009
  • James E Gentle

Data represent information at various levels. The form of data, whether numbers, characters, or picture elements, provide different perspectives. Data of whatever form are represented by groups of 0s and 1s, called bits from the words “binary” and “digits”. (The word was coined by John Tukey.) For representing simple text (that is, strings of characters with no special representation), the bits are usually taken in groups of eight, called bytes, or in groups of sixteen, and associated with a specific character according to a fixed coding rule. Because of the common association of a byte with a character, those two words are often used synonymously. For representing characters in bytes, “ASCII” (pronounced “askey”, from American Standard Code for Information Interchange), was the first standard code widely used. At first only English letters, Arabic numerals, and a few marks of punctuation had codes. Gradually over time more and more symbols were given codified representations. Also, because the common character sets differ from one language to another (both natural languages and computer languages), there are several modifications of the basic ASCII code set. When there is a need for more different characters than can be represented in a byte (28), codes to associate characters with larger groups of bits are necessary. For compatibility with the commonly used ASCII codes using groups of 8 bits, these codes usually are for groups of 16 bits. These codes for “16-bit characters” are useful for representing characters in some Oriental languages, for example. The Unicode Consortium has developed a 16-bit standard, called Unicode, that is widely used for representing characters from a variety of languages. For any ASCII character, the Unicode representation uses eight leading 0s and then the same eight bits as the ASCII representation. An important consideration in the choice of a method to represent data is the way data are communicated within a computer and between the computer and peripheral components such as data storage units. Data are usually treated as a fixed-length sequence of bits. The basic grouping of bits in a computer is sometimes called a “word” or a “storage unit”. The lengths of words or storage units commonly used in computers are 32 or 64 bits.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/iwofc48002.2019.9078468
The Impact of Non-linear NVM Devices on In-Memory Computing
  • Dec 1, 2019
  • Sai Zhang + 4 more

Deep learning has significantly improved the accuracy of large-scale visual/auditory recognition and classification tasks, at the cost of ever-increasing computational resource and storage capacity in hardware. As a result, the data communication between the computing and storage units has been the bottleneck in Artificial Intelligence (AI) computation. The emerging resistive NVMs based in-memory computing architectures have been considered at the promising solution scheme to address the abovementioned issue. However, the non-linearity of the NVM devices has a significant impact on the computing accuracy. In this paper, a non-linear RRAM is modelled and implemented in various in-memory computing architectures. The results show severe accuracy losses caused by the non-linear reading/writing property, mismatch, uncertainty, etc. Several promising solutions are also discussed in this paper.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icwapr.2009.5207485
Infrared face recognition method based on blood perfusion image and Curvelet transformation
  • Jul 1, 2009
  • Zhi-Hua Xie + 3 more

In this paper, a fast infrared face recognition method using blood perfusion conversion and Curvelet transformation is proposed. Firstly, to get the good performance of infrared face recognition from the biological feature, thermal images are converted into blood perfusion domain by blood perfusion model. Secondly, Curvelet transform has better directional and edge representation abilities than widely used wavelet transformation and other classic transformations. Inspired by these attractive attributes of Curvelets in sparse representation of the images, we introduce the idea of decomposing images into their curvelet subbands to extract the principal representative feature, which saves the computational complexity and storage units. Finally, the nearest neighbor classifier is chosen to get the method recognition result. The experiments illustrate that compared with those traditional methods based on PCA, the proposed method has better performance and requires fewer computations and memory units.

Save Icon
Up Arrow
Open/Close