Single Instruction Multiple Data Architecture Research Articles

Massive MIMO (Multiple Input Multiple Output) systems impose significant processing burdens along with strict latency requirements. The combination of large-scale antenna arrays and wide bandwidth requirements for next-generation wireless systems creates an exponential increase in frontend to backend data. Balancing the processing latency and reliability is critical for baseband processing tasks such as QAM detection. While linear detection algorithms have low computational complexity, their use in Massive MIMO scenario has heavy degradation in error performance. Nonlinear detection methods such as Maximum Likelihood and Sphere Decoding have good error performance, but they suffer from high, variable, and uncontrollable computational complexity. For such cases, the K-best QAM detection algorithm can provide required control over the system performance while maintaining near-ML error performance. In this paper, hard-output, as well as soft-output K-best QAM detection, is implemented in a CPU by utilizing the multiple cores combined with vector processing. Similarly, hard-output detection in a GPU is implemented by leveraging the SIMD (Single Instruction, Multiple Data) architecture and Warp-based execution model. The processing time per bit and the energy consumption per bit are compared for CPU and GPU implementations for QAM constellation density and MIMO array size. The GPU implementation shows up to 5× processing latency per bit improvement and up to 120× energy consumption per bit improvement over the CPU implementation for typical QAM constellations such as 4, 16, and 64 QAM. GPU implementation also shows up to 125× improvement over CPU implementation in energy consumption per bit for larger MIMO configurations such as 24 × 24 and 32 × 32. Finally, the soft-output detector is combined with a LDPC (Low-Density Parity Check) decoder to obtain the FER (Frame Error Rate) performance for CPU implementation. The FER is then combined with frame processing latency to form a Goodput metric to demonstrate the latency and reliability tradeoff.

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.

Single Instruction Multiple Data Architecture Research Articles

Related Topics

Articles published on Single Instruction Multiple Data Architecture

Parallel Implementation of K-Best Quadrature Amplitude Modulation Detection for Massive Multiple Input Multiple Output Systems

Serial and parallel kernelization of Multiple Hitting Set parameterized by the Dilworth number, implemented on the GPU

Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputers

Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Architectures with the CMS Detector

Highly Parallel Vector Radix FFT for High Throughput Video Applications

A Work Efficient Parallel Algorithm for Exact Euclidean Distance Transform.

CNFET-Based High Throughput SIMD Architecture

Revised simplex algorithm for linear programming on GPUs with CUDA

Integrated Exploration Methodology for Data Interleaving and Data-to-Memory Mapping on SIMD Architectures

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

An FPGA-Based SIMD Architecture for Video Compression with Scalable Throughput

The SIMD accelerator for business analytics on the IBM z13

Divergent Branch Threads Compaction for Efficient SIMD Control Flow

A Low-Energy Wide SIMD Architecture with Explicit Datapath

A binary algorithm with low divergence for modular inversion on SIMD architectures

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Eclat Algorithm for FIM on CPU-GPU co-operative & parallel environment

Implementation of LTE system on an SDR platform using CUDA and UHD

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

A flexible algorithm for calculating pair interactions on SIMD architectures

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Single Instruction Multiple Data Architecture Research Articles

Related Topics

Articles published on Single Instruction Multiple Data Architecture

Parallel Implementation of K-Best Quadrature Amplitude Modulation Detection for Massive Multiple Input Multiple Output Systems

Serial and parallel kernelization of Multiple Hitting Set parameterized by the Dilworth number, implemented on the GPU

Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputers

Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Architectures with the CMS Detector

Highly Parallel Vector Radix FFT for High Throughput Video Applications

A Work Efficient Parallel Algorithm for Exact Euclidean Distance Transform.

CNFET-Based High Throughput SIMD Architecture

Revised simplex algorithm for linear programming on GPUs with CUDA

Integrated Exploration Methodology for Data Interleaving and Data-to-Memory Mapping on SIMD Architectures

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

An FPGA-Based SIMD Architecture for Video Compression with Scalable Throughput

The SIMD accelerator for business analytics on the IBM z13

Divergent Branch Threads Compaction for Efficient SIMD Control Flow

A Low-Energy Wide SIMD Architecture with Explicit Datapath

A binary algorithm with low divergence for modular inversion on SIMD architectures

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Eclat Algorithm for FIM on CPU-GPU co-operative &amp; parallel environment

Implementation of LTE system on an SDR platform using CUDA and UHD

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

A flexible algorithm for calculating pair interactions on SIMD architectures

Eclat Algorithm for FIM on CPU-GPU co-operative & parallel environment