Deep Learning Accelerators Research Articles

With the proliferation of low-cost sensors and the Internet of Things, the rate of producing data far exceeds the compute and storage capabilities of today’s infrastructure. Much of this data takes the form of time series, and in response, there has been increasing interest in the creation of time series archives in the past decade, along with the development and deployment of novel analysis methods to process the data. The general strategy has been to apply a plurality of similarity search mechanisms to various subsets and subsequences of time series data to identify repeated patterns and anomalies; however, the computational demands of these approaches renders them incompatible with today’s power-constrained embedded CPUs. To address this challenge, we present FA-LAMP, an FPGA-accelerated implementation of the Learned Approximate Matrix Profile (LAMP) algorithm, which predicts the correlation between streaming data sampled in real-time and a representative time series dataset used for training. FA-LAMP lends itself as a real-time solution for time series analysis problems such as classification. We present the implementation of FA-LAMP on both edge- and cloud-based prototypes. On the edge devices, FA-LAMP integrates accelerated computation as close as possible to IoT sensors, thereby eliminating the need to transmit and store data in the cloud for posterior analysis. On the cloud-based accelerators, FA-LAMP can execute multiple LAMP models on the same board, allowing simultaneous processing of incoming data from multiple data sources across a network. LAMP employs a Convolutional Neural Network (CNN) for prediction. This work investigates the challenges and limitations of deploying CNNs on FPGAs using the Xilinx Deep Learning Processor Unit (DPU) and the Vitis AI development environment. We expose several technical limitations of the DPU, while providing a mechanism to overcome them by attaching custom IP block accelerators to the architecture. We evaluate FA-LAMP using a low-cost Xilinx Ultra96-V2 FPGA as well as a cloud-based Xilinx Alveo U280 accelerator card and measure their performance against a prototypical LAMP deployment running on a Raspberry Pi 3, an Edge TPU, a GPU, a desktop CPU, and a server-class CPU. In the edge scenario, the Ultra96-V2 FPGA improved performance and energy consumption compared to the Raspberry Pi; in the cloud scenario, the server CPU and GPU outperformed the Alveo U280 accelerator card, while the desktop CPU achieved comparable performance; however, the Alveo card offered an order of magnitude lower energy consumption compared to the other four platforms. Our implementation is publicly available at https://github.com/aminiok1/lamp-alveo.

Hardware acceleration of Artificial Intelligence (AI) workloads has gained widespread popularity with its potential to deliver unprecedented performance and efficiency. An important challenge remains in how AI accelerators are programmed to sustain high utilization without impacting end-user productivity. Prior software optimizations start with an input graph and focus on node-level optimizations, viz. dataflows and hierarchical tiling, and graph-level optimizations such as operation fusion. However, little effort has been devoted to inter-node on-chip scratchpad memory (SPM) management in Deep Learning (DL) accelerators, whose significance is bolstered by the recent trends in complex network topologies and the emergence of eager execution in DL frameworks. We characterize and show that there exists up to a 5.2× performance gap in DL inference to be bridged using SPM management and propose OnSRAM, a novel SPM management framework integrated with the compiler runtime of a DL accelerator. We develop two variants, viz. OnSRAM-Static, which works on static graphs to identify data structures that can be lucratively held on-chip based on their size, liveness and significance, and OnSRAM-Eager, which targets an eager execution model (no graph) and uses a history-based speculative scheme to hold/discard data structures. We integrate OnSRAM with TensorFlow and analyze it on multiple accelerator configurations. Across a suite of 12 images, objects, and language networks, on a 3 TFLOP system with a 2 MB SPM and 32 GBps external memory bandwidth, OnSRAM-Static and OnSRAM-Eager achieve 1.02–4.8× and 1.02–3.1× reduction in inference latency (batch size of 1), over a baseline with no SPM management. In terms of energy savings, we observe average reductions of 1.51× (up to 4.1×) and 1.23× (up to 2.9×) for the static and eager execution scenarios, respectively.

Deep Learning Accelerators Research Articles

Related Topics

Articles published on Deep Learning Accelerators

A Novel Mixed Precision Distributed TPU GAN for Accelerated Learning Curve

In-Memory Computing for Machine Learning and Deep Learning

DeepEdgeSoC: End-to-end deep learning framework for edge IoT devices

FPGA-based Acceleration of Time Series Similarity Prediction: From Cloud to Edge

DLA-H: A Deep Learning Accelerator for Histopathologic Image Classification.

Benchmarking edge computing devices for grape bunches and trunks detection using accelerated object detection single shot multibox deep learning models

Feasibility Analysis and Implementation of Adaptive Dynamic Reconfiguration of CNN Accelerators

Compute-in-Memory Technologies and Architectures for Deep Learning Workloads

AIDA: Associative In-Memory Deep Learning Accelerator

Architecting Decentralization and Customizability in DNN Accelerators for Hardware Defect Adaptation

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

HyCA: A Hybrid Computing Architecture for Fault-Tolerant Deep Learning

An 8.9–71.3 TOPS/W Deep Learning Accelerator for Arbitrarily Quantized Neural Networks

Topologically Protected All‐Optical Memory (Adv. Electron. Mater. 10/2022)

A Reconfigurable Convolution-in-Pixel CMOS Image Sensor Architecture

A Fine-Grained Modeling Approach for Systolic Array-Based Accelerator

LiteCON : An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning

An Efficient Deep Learning Accelerator Architecture for Compressed Video Analysis

Int-Monitor: a model triggered hardware trojan in deep learning accelerators

Higher order neural processing with input-adaptive dynamic weights on MoS2 memtransistor crossbars

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Deep Learning Accelerators Research Articles

Related Topics

Articles published on Deep Learning Accelerators

A Novel Mixed Precision Distributed TPU GAN for Accelerated Learning Curve

In-Memory Computing for Machine Learning and Deep Learning

DeepEdgeSoC: End-to-end deep learning framework for edge IoT devices

FPGA-based Acceleration of Time Series Similarity Prediction: From Cloud to Edge

DLA-H: A Deep Learning Accelerator for Histopathologic Image Classification.

Benchmarking edge computing devices for grape bunches and trunks detection using accelerated object detection single shot multibox deep learning models

Feasibility Analysis and Implementation of Adaptive Dynamic Reconfiguration of CNN Accelerators

Compute-in-Memory Technologies and Architectures for Deep Learning Workloads

AIDA: Associative In-Memory Deep Learning Accelerator

Architecting Decentralization and Customizability in DNN Accelerators for Hardware Defect Adaptation

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

HyCA: A Hybrid Computing Architecture for Fault-Tolerant Deep Learning

An 8.9–71.3 TOPS/W Deep Learning Accelerator for Arbitrarily Quantized Neural Networks

Topologically Protected All‐Optical Memory (Adv. Electron. Mater. 10/2022)

A Reconfigurable Convolution-in-Pixel CMOS Image Sensor Architecture

A Fine-Grained Modeling Approach for Systolic Array-Based Accelerator

LiteCON : An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning

An Efficient Deep Learning Accelerator Architecture for Compressed Video Analysis

Int-Monitor: a model triggered hardware trojan in deep learning accelerators

Higher order neural processing with input-adaptive dynamic weights on MoS2 memtransistor crossbars