Thread Management Research Articles

Simultaneous Multithreading (SMT) architectures are proposed to better explore on-chip parallelism, which capture the essence of performance improvement in modern processors. SMT overcomes the limits in a single thread by fetching and executing from multiple of them in a shared fashion. The long-latency operations, however, still cause inefficiency in SMT processors. When instructions have to wait for data from lower-level memory hierarchy, the dependent instructions cannot proceed, hence continue occupying the shared resources on the chip for an extended number of clock cycles. This introduces undesired inter-thread interference in SMT processors, which further leads to negative impacts on overall system throughput and average thread performance. In practice, instruction fetch policies take the responsibility of assigning thread priority at the fetch stage, in an effort to better distribute the shared resources among threads in the same core to cope with the long-latency operations and other runtime behavior from the thread for better performance.In this paper we propose an instruction fetch policy RUCOUNT, which considers resource utilization of individual thread in the prioritization process. The proposed policy observes instructions in the front-end stages of the pipeline as well as low-level data misses to summarize the resource utilization for thread management. Higher priority is granted to the thread(s) with less utilized resources, such that overall resources are distributed more efficiently in SMT processors. As a result, it has two unique features compared to other policies: one is to observe the hardware resource comprehensively and the other is to monitor limited resource entries. Our experimental results demonstrate that RUCOUNT is 20% better than ICOUNT, 10% than Stall, 8% than DG and 3% than DWarn, in terms of averaged performance. Considering its hardware overhead is at the similar level as ICOUNT and DWarn, our proposed instruction fetch policy RUCOUNT is superior among the studied policies.

Read full abstract

Traditionally, runtime management involving CPU sharing, real-time scheduling, etc., is provided by the runtime environment (typically an operating system) using hardware support such as timers and interrupts. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms are typically feasible/available. Mapping packet processing tasks on network processors involves complex trade-offs to maximize parallelism and pipelining. Due to an increase in the size of the code store and complexity of application requirements, network processors are being programmed with heterogeneous threads that may execute code belonging to different tasks on a given micro-engine. Also, most network applications are streaming applications that are typically processed in a pipelined fashion. Thus, the tasks on different micro-engines are pipelined in such a way as to maximize the throughput. Tasks themselves could have different runtime performance demands. In this article, we focus on network processors on which hardware can only schedule threads in a round-robin fashion and no OS assistance is provided. We show that it is very difficult and inefficient for the programmer to meet the constraints of runtime management by coding them statically. Due to the infeasibility of hardware or OS solution (even in the near future), we undertake a compiler approach. We propose a complete compiler solution to automatically insert explicit context switch (ctx) instructions provided on the network processor such that the execution of threads is better manipulated at runtime to meet their constraints. Two approaches are presented that can control programs’ runtime behavior with different applicability and overheads. We show that it is feasible and also opens new application domains that would need heterogeneous thread programming. Such approaches would in general become important for multicore processors. Finally, our experiments show that the runtime constraints are enforced nearly ideally with minimal runtime degradation and small code growth.

Read full abstract

Thread Management Research Articles

Related Topics

Articles published on Thread Management

Computing on many cores

Mth: Codesigned Hardware/Software Support for Fine Grain Threads

Exploiting Data-Parallelism on Multicore and SMT Systems for Implementing the Fractal Image Compressing Problem

A Novel Wavefront-Based High Parallel Solution for HEVC Encoding

NestedMP: Enabling cache-aware thread mapping for nested parallel shared memory applications

Scheduling a Video Transcoding Server to Save Energy

A resource utilization based instruction fetch policy for SMT processors

Understanding energy behaviors of thread management constructs

Scheduling and thread management with RTEMS

The Gamma-Ray Imaging Framework

GPGPU를 위한 쉐이더 명령어기반 멀티 스레드 관리 기법

Parallel implementation of background subtraction algorithms for real-time video processing on a supercomputer platform

Dynamically dispatching speculative threads to improve sequential execution

Multithreading on reconfigurable hardware: An architectural approach

A lightweight and extensible Complex Event Processing system for sense and respond applications

PERFORMANCE ANALYSIS OF PARALLEL MATRIX MULTIPLICATION ON A MULTI-CORE COMPUTER USING JAVA THREADS

Compiler-Supported Thread Management for Multithreaded Network Processors

Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

A moving threads processor architecture MTPA

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Thread Management Research Articles

Related Topics

Articles published on Thread Management

Computing on many cores

Mth: Codesigned Hardware/Software Support for Fine Grain Threads

Exploiting Data-Parallelism on Multicore and SMT Systems for Implementing the Fractal Image Compressing Problem

A Novel Wavefront-Based High Parallel Solution for HEVC Encoding

NestedMP: Enabling cache-aware thread mapping for nested parallel shared memory applications

Scheduling a Video Transcoding Server to Save Energy

A resource utilization based instruction fetch policy for SMT processors

Understanding energy behaviors of thread management constructs

Scheduling and thread management with RTEMS

The Gamma-Ray Imaging Framework

GPGPU를 위한 쉐이더 명령어기반 멀티 스레드 관리 기법

Parallel implementation of background subtraction algorithms for real-time video processing on a supercomputer platform

Dynamically dispatching speculative threads to improve sequential execution

Multithreading on reconfigurable hardware: An architectural approach

A lightweight and extensible Complex Event Processing system for sense and respond applications

PERFORMANCE ANALYSIS OF PARALLEL MATRIX MULTIPLICATION ON A MULTI-CORE COMPUTER USING JAVA THREADS

Compiler-Supported Thread Management for Multithreaded Network Processors

Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip

Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs

A moving threads processor architecture MTPA