Comparing Data Staging Techniques for Large Scale Brain Images

Lena Oden

doi:10.1109/tetc.2020.3028744

Abstract

The use of Deep Learning methods is identified as a key opportunity for enabling processing of extreme-scale datasets. Efficient processing of these data sets thus requires the ability to store petabytes of data as well as to access this data fast. Hierarchical storage architectures are a promising technology to allow faster access to frequently used data while providing high capacity. The efficient use is hard, as they usually provide a lower capacity. One way to overcome this bottleneck is staging. Frequently used data are temporarily stored in a faster memory for faster access. In this work, we evaluate four different staging techniques for two Deep Learning use cases, which are very challenging for the underlying IO system. We will analyze and evaluate these different methods on three different staging layers: local SSDs, local SSDs clustered in a parallel file system, and a dedicated storage server. We also evaluate the performance of staging data in DRAM. The best performance is reached with specialized solutions. Still, we developed a technique called split staging, which has comparable performance to the specialized solutions. Our results also show that the performance often depends more on the data layout than on the storage layer.

Full Text