Dynamic workload orchestration is one of the main concerns when working with heterogeneous computing infrastructures in the edge-cloud continuum. In this context, FPGA-based computing nodes can take advantage of their improved flexibility, performance and energy efficiency provided that they use proper resource management strategies. In this regard, many state-of-the-art systems rely on proactive power management techniques and task scheduling decisions, which in turn require deep knowledge about the applications to be accelerated and the actual response of the target reconfigurable fabrics when executing them. While acquiring this knowledge at design time was more or less feasible in the past, with applications mostly being static task graphs that did not change at run time, the highly dynamic nature of current workloads in the edge-cloud continuum, where tasks can be deployed on any node and at any time, has removed this possibility. As a result, being able to derive such information at run time to make informed decisions has become a must. This paper presents an infrastructure to build incremental ML models that can be used to obtain run-time power consumption and performance estimations in FPGA-based reconfigurable multi-accelerator systems operating under dynamic workloads. The proposed infrastructure features a novel stop-and-restart resource-aware mechanism to monitor and control the model training and evaluation stages during normal system operation, enabling low-overhead updates in the models to account for either unexpected acceleration requests (i.e., tasks not considered previously by the models) or model drift (e.g., fabric degradation). Experimental results show that the proposed approach induces a maximal additional error of 3.66% compared to a continuous training alternative. Furthermore, the proposed approach incurs only a 4.49% execution time overhead, compared to the 20.91% overhead induced by the continuous training alternative. The proposed modeling strategy enables innovative scheduling approaches in reconfigurable systems. This is exemplified by the conflict-aware scheduler introduced in this work, which achieves up to a 1.35 times speedup in executing the experimental workload. Additionally, the proposed approach demonstrates superior adaptability compared to other methods in the literature, particularly in response to significant changes in workload and to mitigate the effects of model overfitting. The portability of the proposed modeling methodology and monitoring infrastructure is also shown through their application to both Zynq-7000 and Zynq UltraScale+ devices.
Read full abstract