Enabling near-data processing in distributed object storage systems

Ian F. Adams,Neha Agrawal,Michael P. Mesnier

doi:10.1145/3465332.3470881

Abstract

Most general-purpose distributed storage systems are not designed with near data processing (NDP) in mind. They do not respect semantic data boundaries when writing data, for example splitting a record across servers. This reduces NDP effectiveness by requiring data collation before computation. While semantic data awareness and NDP functions can be retroactively added to existing distributed storage, it is often complex and difficult to accomplish in practice. We propose sharing storage system layout information with data writers so they can adjust data layouts to prevent data alignment issues regardless of the underlying architectures. By doing so, we can simplify NDP implementation by reducing the need for data reassembly, and reduce the need for complex storage system or application extensions. We demonstrate a hinting mechanism on both HDFS with computational block storage and an erasure coded MinIO deployment, reducing data movement by up to 99% when querying CSV data with NDP co-located with the stored data. This was accomplished purely with client side data alignment, no modifications to the server side write paths, and no inter-node collation of data.

Full Text