Abstract

To capitalize on the growing abundance of multicore hardware, FPGA vendors have begun to parallelize the most compute intensive algorithms in their CAD software. However, parallelization is a painstaking and hence expensive process that limits the number of algorithms that can be cost-effectively parallelized. Transactional Memory (TM) promises an easier-to-use alternative to locks for critical sections in threaded code—allowing programmers to avoid deadlocks and data races, and also allowing critical sections to execute in parallel as long as they dynamically access independent data. In this paper, we present our work on using TM to parallelize simulated annealing-based placement for FPGAs. In particular, we use a software TM (TinySTM) to parallelize the placement phase of Versatile Place and Route (VPR) 5.0.2 [1]. With TM we very quickly produced a parallel and correct version of the software, allowing us to focus on incrementally tuning performance. We describe our experiences in tuning the TM system and CAD software, and the interesting algorithmic trade-offs that exist. In the end, we found that optimized transactional placement has the potential for scalable performance: our non-deterministic implementation achieves self-relative speedups over a single thread of 1.82x, 3.62x and 7.27x at 2, 4, and 8 threads respectively with little quality degradation. However, hardware support for TM is likely required to overcome the overheads of STM, as our implementation's single thread performance is 8x slower than sequential VPR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call