Increasing usage of <i>Solid-State Drives</i> (SSDs) has greatly boosted the performance of storage backends. SSDs perform many internal processes such as out-of-place writes, wear-leveling, and garbage collection. These operations are complex and not well documented which make it difficult to create accurate SSD simulators. Our survey indicates that aside from complex configuration, available SSD simulators do not support both <i>sync</i> and <i>discard</i> requests. Past performance models also ignore the long term effect of I/O requests on SSD performance, which has been demonstrated to be significant. In this article, we utilize a methodology based on machine learning that extracts history-aware features at low cost to train SSD performance models that predict request response times. A key goal of our work is to achieve real-time or near-real time feature extraction and to achieve practical training times so our work can be considered as part of solutions that perform online or periodical characterization such as adaptive storage algorithms. Thus, we extract features from individual read, write, <i>sync</i> , and <i>discard</i> I/O requests and use structures such as exponentially decaying counters to track past activity using <inline-formula><tex-math notation="LaTeX">$O(1)$</tex-math></inline-formula> memory and processing cost. To make our methodology accessible and usable in real-world online scenarios, we focus on machine learning models that can be trained quickly on a single machine. To massively reduce processing and memory cost, we utilize feature selection to reduce feature count by up to 63%, allowing a feature extraction rate of 313,000 requests per second using a single thread. Our dataset contains 580M requests taken from 35 workloads. We experiment with three families of machine learning models, a) decision trees, b) ensemble methods utilizing decision trees, and c) <i>Feedforward Neural Networks</i> (FNN). Based on these experiments, FNN achieves an average <inline-formula><tex-math notation="LaTeX">$R^2$</tex-math></inline-formula> score of 0.72 compared to 0.61 and 0.45 for the Random Forest and Bagging, respectively, where <inline-formula><tex-math notation="LaTeX">$R^2 \in (-\inf, 1)$</tex-math></inline-formula> of 1 indicates a perfect fit. However, while the random forest model has lower accuracy, it uses general processing hardware and can be trained much faster, making it viable for use in online scenarios.
Read full abstract