Instruction-level Parallelism Research Articles

This research addresses the issue of evaluating CPU rendering performance by introducing the innovative benchmark test suite construction method RenderBench. This method combines CPU microarchitecture features with rendering task characteristics to comprehensively assess CPU performance across various rendering tasks. Adhering to principles of representativeness and comprehensiveness, the constructed benchmark test suite encompasses diverse rendering tasks and scenarios, ensuring accurate capture of CPU performance features. Through data sampling and in-depth analysis, this study focuses on the role of microarchitecture-independent features in rendering programs, including instruction-level parallelism, instruction mix, branch prediction capability, register dependency distance, data flow stride, and memory reuse distance. The research findings reveal significant variations in rendering programs across these features. For instance, in terms of instruction-level parallelism, rendering programs demonstrate a high level of ILP (instruction-level parallelism), with an average value of 5.70 for ILP256, surpassing benchmarks such as Mibench and NAS Parallel Benchmark. Furthermore, in aspects such as instruction mix, branch prediction capability, register dependency distance, data flow stride, and memory reuse distance, rendering programs exhibit distinct characteristics. Through the application of the RenderBench method, a scalable and highly representative benchmark test suite was constructed, facilitating an in-depth exploration of CPU performance bottlenecks in rendering tasks. By delving into microarchitecture-independent features, this study provides profound insights into rendering program performance, offering valuable guidance for optimizing CPU rendering performance. The application of ensemble learning models, such as random forest, XGBoost, and ExtraTrees, reveals the significant influence of features like floating-point computation, memory access patterns, and register usage on CPU rendering program performance. These insights not only offer robust guidance for performance optimization but also underscore the importance of feature selection and algorithm choice. In summary, the results of feature importance ranking in this study provide beneficial directions and deep insights for the optimization and enhancement of CPU rendering program performance. These findings are poised to exert a positive impact on future research and development endeavors.

Efficient implementations of software masked designs constitute both an important goal and a significant challenge to Side Channel Analysis attack (SCA) security. In this paper we discuss the shortfall between generic C implementations and optimized (inline-) assembly versions while providing a large spectrum of efficient and generic masked implementations for any order, and demonstrate cryptographic algorithms and masking gadgets with reference to the state of the art. Our main goal is to show the prime performance gaps we can expect between different implementations and suggest how to harness the underlying hardware efficiently, a daunting task for various masking-orders or masking algorithm (multiplications, refreshing etc.). This paper focuses on implementations targeting wide vector bitsliced designs, such as the ISAP algorithm. We explore concrete instances of implementations utilizing processors enabled by wide-vector capability extensions of the AMD64 Instruction Set Architecture (ISA); namely, the SSE2/3/4.1, AVX-2 and AVX-512 Streaming Single Instruction Multiple Data extensions. These extensions mainly enable efficient memory level parallelism and provide a gradual reduction in computation-time as a function of the level of extension and the hardware support for instruction-level parallelism. For the first time we provide a complete open-source repository of such gadgets tailored for these extensions, various gadgets types and for all orders. We evaluate the disparities between generic high-level language masking implementations for optimized (inline-) assembly and conventional single execution path data-path architectures such as the ARM architecture. We underscore the crucial trade-off between state storage in the data-memory as compared to keeping it in the register-file (RF). This relates specifically to masked designs, and is particularly difficult to resolve because it requires inline-assembly manipulations and is not natively supported by compilers. Moreover, as the masking order (d) increases and the state gets larger, there must be an increase in data memory read/write accesses for state handling since the RF is simply not large enough. This requires careful optimization which depends to a considerable extent on the underlying algorithm to implement. We discuss how full utilization of SSE extensions is not always possible; i.e. when d is not a power of two, and pin-point the optimal d values and very sub-optimal values of d which aggressively under-utilize the hardware. More generally, this paper presents several different fully generic masked implementations for any order or multiple highly optimized (inline-) assembly instances which are quite generic (for a wide spectrum of ISAs and extensions), and provide very specific implementations targeting specific extensions. The goal is to promote open-source availability, research, improvement and implementations relating to SCA security and masked designs. The building blocks and methodologies provided here are portable and can be easily adapted to other algorithms.

Instruction-level Parallelism Research Articles

Related Topics

Articles published on Instruction-level Parallelism

Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip

Improving performance of simultaneous multithreading CPUs using autonomous control of speculative traces

LUAEMA: A Loop Unrolling Approach Extending Memory Accessing for Vector Very-Long-Instruction-Word Digital Signal Processor with Multiple Register Files

Instruction Level Parallelism and Memory Synchronization

High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic Programming

Flip : Data-centric Edge CGRA Accelerator

Investigating performance metrics for container-based HPC environments using x86 and OpenPOWER systems

On the interactions between ILP and TLP with hardware transactional memory

RenderBench: The CPU Rendering Benchmark Suite Based on Microarchitecture-Independent Characteristics

Consistency Constraints for Mapping Dataflow Graphs to Hybrid Dataflow/von Neumann Architectures

MaskSIMD-lib: on the performance gap of a generic C optimized assembly and wide vector extensions for masked software with an Ascon-p test case

Software Pipelining for Quantum Loop Programs

Optimal placement of PMU for enhancing power system monitoring

SpecBox: A Label-Based Transparent Speculation Scheme Against Transient Execution Attacks

Simulation of Pipelined MIPS Floating-Point Units using Node-RED

A Survey on Memory-centric Computer Architectures

G-RMOS: GPU-accelerated Riemannian Metric Optimization on Surfaces

Toward Accurate and Fast Summation

An Application Specific Vector Processor for Efficient Massive MIMO Processing

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Instruction-level Parallelism Research Articles

Related Topics

Articles published on Instruction-level Parallelism

Ditching the Queue: Optimizing Coprocessor Utilization with Out-of-Order CPUs on Compact Systems on Chip

Improving performance of simultaneous multithreading CPUs using autonomous control of speculative traces

LUAEMA: A Loop Unrolling Approach Extending Memory Accessing for Vector Very-Long-Instruction-Word Digital Signal Processor with Multiple Register Files

Instruction Level Parallelism and Memory Synchronization

High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic Programming

Flip : Data-centric Edge CGRA Accelerator

Investigating performance metrics for container-based HPC environments using x86 and OpenPOWER systems

On the interactions between ILP and TLP with hardware transactional memory

RenderBench: The CPU Rendering Benchmark Suite Based on Microarchitecture-Independent Characteristics

Consistency Constraints for Mapping Dataflow Graphs to Hybrid Dataflow/von Neumann Architectures

MaskSIMD-lib: on the performance gap of a generic C optimized assembly and wide vector extensions for masked software with an Ascon-p test case

Software Pipelining for Quantum Loop Programs

Optimal placement of PMU for enhancing power system monitoring

SpecBox: A Label-Based Transparent Speculation Scheme Against Transient Execution Attacks

Simulation of Pipelined MIPS Floating-Point Units using Node-RED

A Survey on Memory-centric Computer Architectures

G-RMOS: GPU-accelerated Riemannian Metric Optimization on Surfaces

Toward Accurate and Fast Summation

An Application Specific Vector Processor for Efficient Massive MIMO Processing