As we look forward to the exascale era, heterogeneous parallel machines with accelerators, such as GPUs, FPGAs, and upcoming on-chip accelerator cores, are expected to play a massive role in architecting the largest systems in the world. While there is significant interest in accelerator-based architectures, much of this interest is an artifact of the hype associated with them. This special issue focuses on understanding the implications of accelerators on the architectures and programming environments of future systems. It seeks to ground accelerator research through studies of application kernels or whole applications on such systems, as well as tools and libraries that improve the performance or productivity of applications trying to use these systems. For accelerator-based heterogeneous systems to truly be a successful high-performance computing (HPC) platform, it is important that we obtain a complete picture of HPC applications and learn the opportunities and challenges these architectures raise. We need to learn the characteristics of computational kernels and applications, and how different software stacks impact them, in order to guide future accelerator-based HPC system designs. In this special issue, we presented case studies about accelerating representative kernels and applications on emerging multicore and manycore systems, including Intel MIC (Many Integrated Core) and GPU architectures. We also demonstrated better designs in the programming models to scale the performance of applications running on HPC systems. Finally, we investigated algorithms designs for large-scale systems and studied the power–performance tradeoffs of various optimizations techniques on heterogeneous platforms. In ‘‘Using MIC to accelerate graph traversal’’, Gao et al. describe a highly optimized breadth-first graph traversal algorithm designed for the MIC architecture. The algorithm utilizes both the MIC accelerator and the host CPU, and thus exploits the full capability of the heterogeneous system. Graph traversal is an important kernel for big data analysis, and we believe their optimized algorithm will help other researchers and practitioners in this area. In ‘‘Comparison sorting on hybrid multicore architectures for fixed and variable length keys’’, Banerjee et al. present a hybrid comparison-based sorting algorithm which utilizes a NVidia GPU and an Intel i7 CPU. The algorithm explores ways to divide-andconquer the overall problem. The algorithm achieves a 20% gain over the current best known comparison sorting algorithm. They also use a look-ahead-based approach to sort strings and obtain around 24% performance benefit over the current best known solution. Sorting has been a topic of immense research value and we think the advance in sorting efficiency may have tremendous impact in various types of applications. In ‘‘Composing multiple StarPU applications over heterogeneous machines: a supervised approach’’, Hugo et al. propose an extension of StarPU, a runtime system specifically designed for heterogeneous architectures to allow multiple parallel codes to run concurrently with minimal interference. They introduce a hypervisor that automatically expands or shrinks scheduling contexts (e.g. resource allocation) using feedbacks from the runtime system. Their mechanism can dramatically improve the overall application runtime by 34%. In ‘‘Evaluating the multi-core and many-core architectures through accelerating the 3D LWC stencil’’, You et al. showcase how they accelerate the iterative stencil loops in wave propagation forward modeling, which is a widely used computational method in oil and gas exploration. They experimented with architectures including Intel Sandy Bridge, NVidia Fermi C2070, NVidia Kepler K20x, and the Intel Xeon Phi Coprocessors. Numerous parallel strategies and optimization techniques are employed. The study about crossplatform performance and power analysis is also conducted. In ‘‘Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms’’, Ukidave et al. evaluate the power/performance efficiency of different optimizations
Read full abstract