Abstract

Customized computing has been of interest to the research community for over three decades. The interest has intensified in the recent years as the power and energy become a significant limiting factor to the computing industry. For example, the energy consumed by the datacenters of some large internet service provides is well over 109 Kilowatt-hours. FPGA-based acceleration has shown 10–1000X performance/energy efficiency over the general-purpose processors in many applications. However, programming FPGAs as a computing device is still a significant challenge. Most of accelerators are designed using manual RTL coding. The recent progress in high-level synthesis (HLS) has improved the programming productivity considerably where one can quickly implement functional blocks written using high-level programming languages as C or C++ instead of RTL. But in using the HLS tool for accelerated computing, the programmer still faces a lot of design decisions, such as implementation choices of each module and communication schemes between different modules, and has to implement additional logic for data management, such as memory partitioning, data prefetching and reuse. Extensive source code rewriting is often required to achieve high-performance acceleration using the existing HLS tools. In this talk, I shall present the ongoing work at UCLA to enable further automation for customized computing. One effort is on automated compilation to combining source-code level transformation for HLS with efficient parameterized architecture template generations. I shall highlight our progress on loop restructuring and code generation, memory partitioning, data prefetching and reuse, combined module selection, duplication, and scheduling with communication optimization. These techniques allows the programmer to easily compile computation kernels to FPGAs for acceleration. Another direction is to develop efficient runtime support for scheduling and transparent resource management for integration of FPGAs for datacenter-scale acceleration, which is becoming a reality (for example, Microsoft recently used over 1,600 servers with FPGAs for accelerating their search engine and reported very encouraging results). Our runtime system provides scheduling and resource management support at multiple levels, including server node-level, job-level, and datacenter-level so that programmer can make use the existing programming interfaces, such as MapReduce or Hadoop, for large-scale distributed computation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call