Inter-core Communication Research Articles

The emergence of multicore architectures and the chip industry’s plan to roll out hundreds of cores per die sometime in the near future might have triggered the evolution of von Neumann architectures towards a parallel processing paradigm. The capability to have hundreds of cores per die is exciting, but how optimally we are able to utilize such a resource remains a challenge. Since there are no straightforward solutions we seek inspiration from relevant scientific processes. Cellular automata which are inherently decentralized and spatially extended structures provide a potential candidate among parallel processing alternatives. The availability of spatial parallelism on field programmable gate arrays make them the ideal platform to investigate cellular automata systems as potential parallel processing paradigms on multicore architectures. This article presents a massively parallel implementation for a floating-point-based cellular automata using special purpose hardware such as Field Programmable Gate Array (FPGAs). The challenge is to best map an application to the underlying many-core architecture and address issues such as inter-core communication, scalability, and flexibility both in terms of hardware and software. Maxwell — a 64-node FPGA supercomputer, is used for accelerator implementations that range from a single to a multiple FPGA-enabled system. A performance model is proposed and demonstrated to closely reproduce measured execution times. The performance model enables identification of the main sources of overhead and suggests improvements to the architecture and implementation of the lattice Boltzmann method and compute-bound cellular automata in general. Further, a 2 million cell 2DQ9 lattice Boltzmann method lattice with periodic boundary conditions, simulated using a multiple FPGA chip accelerator implementation, is presented. The performance model shows how the FPGA-enabled PC cluster is the preferred multiple FPGA organization over the multiple FPGA-based PC setup. Latency hiding is fully exploited for PC cluster-based system implementations and demonstrated using system profiling.

Read full abstract

This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism, and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore, the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1) extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also useful for everyone to exploit the thread-level parallelism in their applications.

Read full abstract

Inter-core Communication Research Articles

Related Topics

Articles published on Inter-core Communication

Overhead-aware energy optimization for real-time streaming applications on multiprocessor System-on-Chip

A portable, efficient inter-core communication scheme for embedded multicore platforms

CRITICAL-PATH DRIVEN ROUTERS FOR ON-CHIP NETWORKS

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Cellular Automata Simulations on a FPGA cluster

Architectural support for thread communications in multi-core processors

A power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints

Hosting an object heap on manycore hardware

A concurrent dynamic analysis framework for multicore hardware

Verifying Communication Protocols Using Live Sequence Chart Specifications

Leveraging Local Intracore Information to Increase Global Performance in Block-Based Design of Systems-on-Chip

A Process-Variation Aware Technique for Tile-Based, Massive Multicore Processors

Clustered Pipelined Multithreading on Commodity Multi-Core Processors

Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors

Topology-Based Performance Analysis and Optimization of Latency-Insensitive Systems

Performance scalability of decoupled software pipelining

Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors

Corona

Dual-thread Speculation: A Simple Approach to Uncover Thread-level Parallelism on a Simultaneous Multithreaded Processor

Joint throughput and energy optimization for pipelined execution of embedded streaming applications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Inter-core Communication Research Articles

Related Topics

Articles published on Inter-core Communication

Overhead-aware energy optimization for real-time streaming applications on multiprocessor System-on-Chip

A portable, efficient inter-core communication scheme for embedded multicore platforms

CRITICAL-PATH DRIVEN ROUTERS FOR ON-CHIP NETWORKS

Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Cellular Automata Simulations on a FPGA cluster

Architectural support for thread communications in multi-core processors

A power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints

Hosting an object heap on manycore hardware

A concurrent dynamic analysis framework for multicore hardware

Verifying Communication Protocols Using Live Sequence Chart Specifications

Leveraging Local Intracore Information to Increase Global Performance in Block-Based Design of Systems-on-Chip

A Process-Variation Aware Technique for Tile-Based, Massive Multicore Processors

Clustered Pipelined Multithreading on Commodity Multi-Core Processors

Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors

Topology-Based Performance Analysis and Optimization of Latency-Insensitive Systems

Performance scalability of decoupled software pipelining

Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors

Corona

Dual-thread Speculation: A Simple Approach to Uncover Thread-level Parallelism on a Simultaneous Multithreaded Processor

Joint throughput and energy optimization for pipelined execution of embedded streaming applications