Memory Bandwidth Constraints Research Articles

The maximum common subgraph of two graphs is the largest possible common subgraph, i.e., the common subgraph with as many vertices as possible. Even if this problem is very challenging, as it has been long proven NP-hard, its countless practical applications still motivates searching for exact solutions. This work discusses the possibility to extend an existing, very effective branch-and-bound procedure on parallel multi-core and many-core architectures. We analyze a parallel multi-core implementation that exploits a divide-and-conquer approach based on a thread pool, which does not deteriorate the original algorithmic efficiency and it minimizes data structure repetitions. We also extend the original algorithm to parallel many-core GPU architectures adopting the CUDA programming framework, and we show how to handle the heavily workload-unbalance and the massive data dependency. Then, we suggest new heuristics to reorder the adjacency matrix, to deal with “dead-ends”, and to randomize the search with automatic restarts. These heuristics can achieve significant speed-ups on specific instances, even if they may not be competitive with the original strategy on average. Finally, we propose a portfolio approach, which integrates all the different local search algorithms as component tools; such portfolio, rather than choosing the best tool for a given instance up-front, takes the decision on-line. The proposed approach drastically limits memory bandwidth constraints and avoids other typical portfolio fragility as CPU and GPU versions often show a complementary efficiency and run on separated platforms. Experimental results support the claims and motivate further research to better exploit GPUs in embedded task-intensive and multi-engine parallel applications.

Read full abstract

Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multicore and single-core processors due to the need for writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling, increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy. Task dataflow programming models have a high potential to simplify parallel programming because they alleviate the programmer from identifying precisely all intertask dependences when writing programs. Instead, the task dataflow runtime system detects and enforces intertask dependences during execution based on the description of memory accessed by each task. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel, taking into account dependences specified in the task graph. Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this article, we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output, and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs , and lists . We also consider a fourth edgeless scheme that synchronizes tasks using integers. Analysis using microbenchmarks shows that the graph representation is not always scalable and that the edgeless scheme introduces least overhead in nearly all situations.

Read full abstract

Memory Bandwidth Constraints Research Articles

Related Topics

Articles published on Memory Bandwidth Constraints

Memory Bandwidth Efficient Design for Super-Resolution Accelerators With Structure Adaptive Fusion and Channel-Aware Addressing

The Maximum Common Subgraph Problem: A Parallel and Multi-Engine Approach

On the strong scalability of maritime CFD

Bandwidth-Aware On-Line Scheduling in SMT Multicores

Analysis of dependence tracking algorithms for task dataflow execution

Analysis of dependence tracking algorithms for task dataflow execution

Real-Time Computation of Local Neighborhood Functions in Application-Specific Instruction-Set Processors

FPGA Architecture for 2D Discrete Fourier Transform Based on 2D Decomposition for Large-sized Data

A Design Space Exploration Algorithm Incompiling Window Operation onto Reconfigurable Hardware

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Memory Bandwidth Constraints Research Articles

Related Topics

Articles published on Memory Bandwidth Constraints

Memory Bandwidth Efficient Design for Super-Resolution Accelerators With Structure Adaptive Fusion and Channel-Aware Addressing

The Maximum Common Subgraph Problem: A Parallel and Multi-Engine Approach

On the strong scalability of maritime CFD

Bandwidth-Aware On-Line Scheduling in SMT Multicores

Analysis of dependence tracking algorithms for task dataflow execution

Analysis of dependence tracking algorithms for task dataflow execution

Real-Time Computation of Local Neighborhood Functions in Application-Specific Instruction-Set Processors

FPGA Architecture for 2D Discrete Fourier Transform Based on 2D Decomposition for Large-sized Data

A Design Space Exploration Algorithm Incompiling Window Operation onto Reconfigurable Hardware