High-end CPU Research Articles

The article is devoted to the issues of increasing the security and efficiency of software implementation for the symmetric block ciphers. For the implementation of cryptoalgorithms on low-end CPUs (8/16/32-bit microcontrollers), it is important to provide increased resistance to power consumption analysis attacks. With regard to the implementation of ciphers on high-end CPUs (x86, ARM Cortex-A), it is important to eliminate the vulnerability primarily to timing and cache attacks. The authors used a bitslice approach to securely implement block ciphers, which has potential advantages such as high speed and low computing resources. However, the known bitsliced methods have a significant limitation, since they work with deterministic S-Boxes or arbitrary S-Boxes of smaller sizes. The paper proposes a new heuristic method for bitsliced representation of cryptographic 8×8 S-Boxes containing randomly generated values. These values defy description using algebraic expressions. The method is based on the decomposition of the truth table, which describes the S-Box, into two parts. One part of the table forms logical masks, and the other is split into bit vectors. To find a logical description of these vectors an exhaustive search is used. After finding the description of all vectors, these two parts of the table are combined into one using logical operations. The use of this method oriented on software implementation in the logical basis {AND, OR, XOR, NOT} ensures the minimization of arbitrary 8×8 S-Boxes. The proposed method can be implemented using standard logical instructions on any 8/16/32/64-bit processors. It is also possible to use logical SIMD instructions from the SSE, AVX, AVX-512 extensions for x86-64 processors, which provides high performance due to the use of long registers. The corresponding software has been developed that implements the method of searching for bitsliced representations of a given S-Box, and also automatically generates C++ code for it based on SSE, AVX and AVX-512 instructions. The effectiveness of the method on the S-Box of known block ciphers, in particular the Ukrainian encryption standard "Kalyna", has been investigated. It was found that the developed algorithm requires almost half as many gates for the bitsliced description of an arbitrary S-Box than the best of known algorithm (370 gates versus 680, respectively). For ciphers that use two or four S-Box tables, joint minimization can yield up to 330 or 300 gates per table, respectively. Keywords: bitslicing; S-Box; logical minimization; SIMD; x86-64 CPU; software implementation; block ciphers.

Read full abstract

One of the main characteristics of High-performance Computing (HPC) applications is that they become increasingly performance and power demanding, pushing HPC systems to their limits. Existing HPC systems have not yet reached exascale performance mainly due to power limitations. Extrapolating from today’s top HPC systems, about 100–200 MWatts would be required to sustain an exaflop-level of performance. A promising solution for tackling power limitations is the deployment of energy-efficient reconfigurable resources (in the form of Field-programmable Gate Arrays (FPGAs)) tightly integrated with conventional CPUs. However, current FPGA tools and programming environments are optimized for accelerating a single application or even task on a single FPGA device. In this work, we present UNILOGIC (Unified Logic), a novel HPC-tailored parallel architecture that efficiently incorporates FPGAs. UNILOGIC adopts the Partitioned Global Address Space (PGAS) model and extends it to include hardware accelerators, i.e., tasks implemented on the reconfigurable resources. The main advantages of UNILOGIC are that (i) the hardware accelerators can be accessed directly by any processor in the system, and (ii) the hardware accelerators can access any memory location in the system. In this way, the proposed architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. The UNILOGIC architecture also provides hardware virtualization of the reconfigurable logic so that the hardware accelerators can be shared among multiple applications or tasks. The FPGA layer of the architecture is implemented by splitting its reconfigurable resources into (i) a static partition, which provides the PGAS-related communication infrastructure, and (ii) fixed-size and dynamically reconfigurable slots that can be programmed and accessed independently or combined together to support both fine and coarse grain reconfiguration. 1 Finally, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis, each of which includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs); each QFDB supports four tightly coupled Xilinx Zynq Ultrascale+ MPSoCs as well as 64 Gigabytes of DDR4 memory, and thus, the prototype features a total of 64 Zynq MPSoCs and 1 Terabyte of memory. We tuned and evaluated the UNILOGIC prototype using both low-level (baremetal) performance tests, as well as two popular real-world HPC applications, one compute-intensive and one data-intensive. Our evaluation shows that UNILOGIC offers impressive performance that ranges from being 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs, while it also outperforms GPUs by a factor ranging from 3 to 6 times in terms of time to solution, and from 10 to 20 times in terms of energy to solution.

Read full abstract

High-end CPU Research Articles

Related Topics

Articles published on High-end CPU

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments

The discovery of a third breakdown: phenomenon, characterization and applications

Homomorphic Encryption on GPU

LOCATOR: Low-power ORB accelerator for autonomous cars

HXDP

Accelerating kNN search in high dimensional datasets on FPGA by reducing external memory access

Евристичний метод для bitsliced подання випадково згенерованих 88 криптографічних S-Box

A Method for Accelerating the Inference Process of FPGA-based LSTM for Biometric Systems

A Novel FPGA-Based Intent Recognition System Utilizing Deep Recurrent Neural Networks

High-throughput, accurate Monte Carlo simulation on CPU hardware for PET applications

The SPEEDY Family of Block Ciphers

SKT

Ascon v1.2: Lightweight Authenticated Encryption and Hashing

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

PolyDL

Exploring Means to Enhance the Efficiency of GPU Bitmap Index Query Processing

PipeArch

UNILOGIC

Cost-optimized heterogeneous FPGA architecture for non-iterative hologram generation.

A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

High-end CPU Research Articles

Related Topics

Articles published on High-end CPU

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments

The discovery of a third breakdown: phenomenon, characterization and applications

Homomorphic Encryption on GPU

LOCATOR: Low-power ORB accelerator for autonomous cars

HXDP

Accelerating kNN search in high dimensional datasets on FPGA by reducing external memory access

Евристичний метод для bitsliced подання випадково згенерованих 88 криптографічних S-Box

A Method for Accelerating the Inference Process of FPGA-based LSTM for Biometric Systems

A Novel FPGA-Based Intent Recognition System Utilizing Deep Recurrent Neural Networks

High-throughput, accurate Monte Carlo simulation on CPU hardware for PET applications

The SPEEDY Family of Block Ciphers

SKT

Ascon v1.2: Lightweight Authenticated Encryption and Hashing

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

PolyDL

Exploring Means to Enhance the Efficiency of GPU Bitmap Index Query Processing

PipeArch

UNILOGIC

Cost-optimized heterogeneous FPGA architecture for non-iterative hologram generation.

A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator