Machine Learning Workloads Research Articles

Processing-in-Memory (PIM) has been widely explored for accelerating data-intensive machine learning computation that mainly consists of general-matrix-multiplication (GEMM), by mitigating the burden of data movements and exploiting the ultra-high memory parallelism. The two mainstreams of PIM, the analog- and digital-type, have both been exploited in accelerating machine learning workloads by numerous outstanding prior works. Currently, the digital-PIM is increasingly favored due to the broader computing support and the avoidance of errors caused by intrinsic non-idealities, e.g., process variation. Nevertheless, it still lacks further optimization considering the characteristics of the GEMM computation, including better efficient data layout and scheduling, and the ability to handle the sparsity of activations at the bit-level. To boost the performance and efficiency of digital SRAM PIM, we propose the architecture called VSPIM that performs the computation in a bit-serial fashion, with unique support of vector-scalar computing pattern. The novelties of the VSPIM can be concluded as follows: 1) support bit-serial based scalar-vector computing via ingenious parallel bit-broadcasting; 2) refine the GEMM mapping strategy and computing pattern to enhance performance and efficiency; 3) powered by the introduced scalar-vector operation, the bit-sparsity of activation is leveraged to halt unnecessary computation to maximize efficiency and throughput. Our comprehensive evaluation shows that, compared to the state-of-the-art SRAM-based digital-PIM design (Neural Cache), VSPIM can significantly boost the performance and energy efficiency by up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$8.87\times$</tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$4.81\times$</tex-math></inline-formula> respectively, with negligible area overhead, upon multiple representative neural networks.

Extreme edge devices or Internet-of-Things (IoT) nodes require both ultra-low power (ULP) always-on (AON) processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications, such as voice recognition, machine monitoring, and so on, requires the ability to execute a wide range of machine learning (ML) workloads. This brings challenges in hardware (HW) design to build flexible processors operating in ULP regime. This article presents TinyVers, a tiny versatile ULP ML system-on-chip (SoC) to enable enhanced intelligence at the extreme edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty cycling to enable smart sensing applications. The SoC combines an reduced instruction set computer-V (RISC-V) host processor, a 17-tera operations per second per watt (TOPS/W) dataflow reconfigurable ML accelerator, a 1.7- <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu$</tex-math> </inline-formula> W deep sleep wake-up controller (WuC), and an embedded magnetoresistive random access memory (eMRAM) for boot code and ML parameter retention. The SoC can perform up to 17.6 giga operations per second (GOPS) while achieving a power consumption range from 1.7 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu$</tex-math> </inline-formula> W to 20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1–2 TOPS/W of energy efficiency with a power consumption below 230 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu$</tex-math> </inline-formula> W in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu$</tex-math> </inline-formula> W.

Machine Learning Workloads Research Articles

Articles published on Machine Learning Workloads

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

A Better Sense Amplifier Improves the Resilience in Compute-In-Memory and Row Hammer

Bendable non-silicon RISC-V microprocessor.

S-Tune: SOT-MTJ manufacturing parameters tuning for securing the next generation of computing

A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural Networks

Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT Continuum

Ferroelectric capacitors and field-effect transistors as in-memory computing elements for machine learning workloads

Cost-effective Cloud Architectures for Large-scale Machine Learning Workloads

Carbon Footprint Reduction for Sustainable Data Centers in Real-Time

Scalable and Efficient Orchestration of Machine Learning Workloads on DSPs with Multi-level Memory Architecture

An Empirical Evaluation of Columnar Storage Formats

SAMBA: Sparsity Aware In-Memory Computing Based Machine Learning Accelerator

TinyVers: A Tiny Versatile System-on-Chip With State-Retentive eMRAM for ML Inference at the Extreme Edge

TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems

Characterization of Timing-based Software Side-channel Attacks and Mitigations on Network-on-Chip Hardware

Technical Perspective: Conjunctive Queries with Comparisons

Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning

Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Machine Learning Workloads Research Articles

Articles published on Machine Learning Workloads

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

A Better Sense Amplifier Improves the Resilience in Compute-In-Memory and Row Hammer

Bendable non-silicon RISC-V microprocessor.

S-Tune: SOT-MTJ manufacturing parameters tuning for securing the next generation of computing

A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural Networks

Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT Continuum

Ferroelectric capacitors and field-effect transistors as in-memory computing elements for machine learning workloads

Cost-effective Cloud Architectures for Large-scale Machine Learning Workloads

Carbon Footprint Reduction for Sustainable Data Centers in Real-Time

Scalable and Efficient Orchestration of Machine Learning Workloads on DSPs with Multi-level Memory Architecture

An Empirical Evaluation of Columnar Storage Formats

SAMBA: Sparsity Aware In-Memory Computing Based Machine Learning Accelerator

TinyVers: A Tiny Versatile System-on-Chip With State-Retentive eMRAM for ML Inference at the Extreme Edge

TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems

Characterization of Timing-based Software Side-channel Attacks and Mitigations on Network-on-Chip Hardware

Technical Perspective: Conjunctive Queries with Comparisons

Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning

Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool