AbstractMany shared‐memory parallel irregular applications, such as sparse linear algebra and graph algorithms, depend on efficient loop scheduling (LS) in a fork‐join manner despite that the work per loop iteration can greatly vary depending on the application and the input. Because of the importance of LS, many different methods (e.g., workload‐aware self‐scheduling) and parameters (e.g., chunk size) have been explored to achieve reasonable performance, and many of these methods require expert prior knowledge about the application and input before runtime. This work proposes a new LS method that requires little to no expert knowledge to achieve speedups close to those of tuned LS methods by self‐managing chunk size based on a heuristic of throughput and using work‐stealing to recover from workload imbalances. This method, named iCh, is implemented into libgomp for testing. It is evaluated against OpenMP's guided, dynamic, and taskloop methods and is evaluated against BinLPT and generic work‐stealing on an array of applications that includes: a synthetic benchmark, breadth‐first search, K‐Means, the molecular dynamics code LavaMD, and sparse matrix‐vector multiplication. On a 28 thread Intel system, iCh is the only method to always be one of the top three LS methods. On average across all applications, iCh is within of the best method and is even able to outperform other LS methods for breadth‐first search and K‐Means.
Read full abstract