Thread Allocation Research Articles

Modern embedded systems execute applications, which interact with the operating system and hardware differently depending on the type of workload. These cross-layer interactions result in wide variations of the chip-wide thermal profile. In this article, a reinforcement learning-based runtime manager is proposed that guarantees application-specific performance requirements and controls the POSIX thread allocation and voltage/frequency scaling for energy-efficient thermal management. This controls three thermal aspects: peak temperature, average temperature, and thermal cycling. Contrary to existing learning-based runtime approaches that optimize energy and temperature individually, the proposed runtime manager is the first approach to combine the two objectives, simultaneously addressing all three thermal aspects. However, determining thread allocation and core frequencies to optimize energy and temperature is an NP-hard problem. This leads to exponential growth in the learning table (significant memory overhead) and a corresponding increase in the exploration time to learn the most appropriate thread allocation and core frequency for a particular application workload. To confine the learning space and to minimize the learning cost, the proposed runtime manager is implemented in a two-stage hierarchy: a heuristic-based thread allocation at a longer time interval to improve thermal cycling, followed by a learning-based hardware frequency selection at a much finer interval to improve average temperature, peak temperature, and energy consumption. This enables finer control on temperature in an energy-efficient manner while simultaneously addressing scalability, which is a crucial aspect for multi-/many-core embedded systems. The proposed hierarchical runtime manager is implemented for Linux running on nVidia’s Tegra SoC, featuring four ARM Cortex-A15 cores. Experiments conducted with a range of embedded and cpu-intensive applications demonstrate that the proposed runtime manager not only reduces energy consumption by an average 15% with respect to Linux but also improves all the thermal aspects—average temperature by 14°C, peak temperature by 16°C, and thermal cycling by 54%.

GPU는 저렴한 비용으로 쉽게 대규모 데이터 병렬성을 활용할 수 있는 장점을 갖고 있어 많은 고성능 컴퓨팅 응용 분야에서 사용되고 있는 추세다. 행렬의 고유벡터를 구하는 power method는 웹 페이지의 중요도를 계산하는 PageRank 알고리즘 등 여러 응용 분야에서 활용되고 있는 방법으로써, 본 연구에서는 power method를 GPU에서 병렬화하여 구현하였으며, 성능을 최적화하기 위한 개선 방법을 제시하였다. Power method는 행렬과 벡터의 곱셈 연산이 반복적으로 수행되며 GPU에서 쉽게 병렬화가 가능하다. 그러나, 고유벡터의 수렴 여부 판단을 위한 연산 등의 작업과 다음 곱셈을 위한 벡터 크기의 조정 등의 작업이 부가적으로 필요하며, 이러한 작업은 GPU 내의 커널 코드를 여러 차례 호출하고 불필요한 데이터 이동을 유발하는 문제점이 있다. 본 연구에서는 커널 호출 회수를 줄이고 스레드 배치를 최적함과 동시에 수렴 여부 판단을 위한 연산을 최적함으로써 power method의 성능을 향상시켰다. GPU computing is emerging in high performance application area since it can easily exploit massive parallelism in a way of cost-effective computing. The power method which finds the eigen vector of a given matrix is widely used in various applications such as PageRank for calculating importance of web pages. In this research we made the power method efficiently parallelized on GPU and also suggested how it can be improved to enhance its performance. The power method mainly consists of matrix-vector product and it can be easily parallelized. However, it should decide the convergence of the eigen vector and need scaling of the vector subsequently. Such operations incur several calls to GPU kernels and data movement between host and GPU memories. We improved the performance of the power method by means of reduced calls to GPU kernels, optimized thread allocation and enhanced decision operation for the convergence.

Thread Allocation Research Articles

Related Topics

Articles published on Thread Allocation

Adaptive Optimization $$l_1$$ l 1 -Minimization Solvers on GPU

Adaptive and Hierarchical Runtime Manager for Energy-Aware Thermal Management of Embedded Systems

Acceleration of the Dual-Field Domain Decomposition Algorithm Using MPI–CUDA on Large-Scale Computing Systems

An executable formal semantics for UML-RT

A Many-Core Based Parallel Tabu Search

A 320 mW 342 GOPS Real-Time Dynamic Object Recognition Processor for HD 720p Video Streams

Flexible Scheduling and Thread Allocation for Synchronous Parallel Tasks

BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems

CUDA 기반 GPU에서 효율적인 Power Method의 구현

Analyzing the execution of sparse matrix-vector product on the Finisterrae SMP-NUMA system

Thread allocation in CMP-based multithreaded network processors

Distributed-sum termination detection supporting multithreaded execution

Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation

Managing the Overall Balance of Operating System Threads on a Multiprocessor Using Automatic Self-Allocating Threads (ASAT)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Thread Allocation Research Articles

Related Topics

Articles published on Thread Allocation

Adaptive Optimization $$l_1$$ l 1 -Minimization Solvers on GPU

Adaptive and Hierarchical Runtime Manager for Energy-Aware Thermal Management of Embedded Systems

Acceleration of the Dual-Field Domain Decomposition Algorithm Using MPI–CUDA on Large-Scale Computing Systems

An executable formal semantics for UML-RT

A Many-Core Based Parallel Tabu Search

A 320 mW 342 GOPS Real-Time Dynamic Object Recognition Processor for HD 720p Video Streams

Flexible Scheduling and Thread Allocation for Synchronous Parallel Tasks

BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems

CUDA 기반 GPU에서 효율적인 Power Method의 구현

Analyzing the execution of sparse matrix-vector product on the Finisterrae SMP-NUMA system

Thread allocation in CMP-based multithreaded network processors

Distributed-sum termination detection supporting multithreaded execution

Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation

Managing the Overall Balance of Operating System Threads on a Multiprocessor Using Automatic Self-Allocating Threads (ASAT)