Abstract
The complex memory hierarchies of nowadays machines make it very difficult to estimate the execution time of the tasks as depending on where the data is placed in memory, tasks of the same type may end up having different performance. Multiple scheduling heuristics have managed to improve performance by taking into account memory-related properties such as data locality and cache sharing. However, we may see tasks in certain applications or phases of applications that take little or no advantage of these optimizations. Without understanding when such optimizations are effective, we may trigger unnecessary overhead at runtime level. In previous work, we introduced TaskInsight, a technique to characterize how the memory behavior of the application is affected by different task schedulers through the analysis of data reuse across tasks. We now use this tool to dynamically trace the scheduling decisions of multithreaded applications through their execution and analyze how memory reuse can provide information on when and why locality-aware optimizations are effective and impact performance. We demonstrate how we can detect particular scheduling decisions that produced a variation in performance, and the underlying reasons when applying TaskInsight to several of the Montblanc benchmarks. This flexible insight is key both for the programmer and runtime to allow assigning the optimal scheduling policy to certain executions or phases.
Highlights
Scheduling tasks in task-based applications have become significantly more difficult due to overall system complexity, to the deep shared memory hierarchies
The graph shows the percentage of the population as a function of slowdown. This summarizes how many of the experiments have a slowdown larger than X%. Benchmarks such as fft, cholesky, reduction and n-body have a high variation in performance across a significant number of their configurations: 40% of the executions of fft have more than 60% performance difference when changing the scheduling policy; for reduction 30% of the executions have over 30% performance difference; and for cholesky, 40% have differences of over 30%
By combining schedule independent memory access profiling and schedule specific hardware performance counter data we are able to identify which scheduling decisions impact performance, when they happen, and why they cause a problem
Summary
Scheduling tasks in task-based applications have become significantly more difficult due to overall system complexity, to the deep shared memory hierarchies. Developers of a task-based application blame this performance degradation on data locality and attempt to characterize their workload based on data reuse without considering the dynamic interaction between the scheduler and the caches [3, 10] This is because there has been no way to obtain precise information on how the data was reused through the execution of the application, such as how long it remained in the caches, and how the scheduling decisions influenced this reuse. We show how applying TaskInsight to the widely adopted Montblanc benchmarks reveals deep insight into why scheduling changed the memory behavior of applications, the key to understanding performance variation across different executions. We cover related previous work (Section 3) to conclude with some remarks on how the TaskInsight analysis enables us to understand other behaviors across the benchmarks and schedulers (Conclusion)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.