When mapping C programs to hardware, highlevel synthesis (HLS) tools reorder independent instructions, aiming to obtain a schedule that requires as few clock cycles as possible. However, when synthesizing multithreaded C programs, reordering opportunities are limited by the presence of atomic operations (“atomics”), the fundamental concurrency primitives in C. Existing HLS tools analyze and schedule each thread in isolation. In this article, we argue that thread-local analysis is conservative, especially since HLS compilers have access to the entire program. Hence, we propose a global analysis that exploits information about memory accesses by all threads when scheduling each thread. Implemented in the LegUp HLS tool, our analysis is sensitive to sequentially consistent (SC) and weak atomics and supports loop pipelining. Since the semantics of C atomics is complicated, we formally verify that our analysis correctly implements the C memory model using the Alloy model checker. Compared with thread-local analysis, our global analysis achieves a 2.3× average speedup on a set of lock-free data structures and data-flow patterns. We also apply our analysis to a larger application: a lock-free, streamed, and load-balanced implementation of Google's PageRank, where we see a 1.3× average speedup compared with the thread-local analysis.
Read full abstract