Tensor Cores Research Articles

In recent years, significant strides in Artificial Intelligence (AI) have led to various practical applications, primarily centered around training and deployment of deep neural networks (DNNs). These applications, however, require considerable computational resources, predominantly reliant on modern Graphics-Processing Units (GPUs). Yet, the quest for larger and faster DNNs has spurred the creation of specialized AI chips and efficient Machine-Learning (ML) software tools like TensorFlow and PyTorch have been developed for striking a balance between usability and performance. Simultaneously, the field of computational neuroscience shares a similar quest for increased computational power to simulate more extensive and detailed brain models, while also keeping usability high. Although GPUs have also entered this field, programming complexity remains high, resulting in cumbersome simulations. Inspired by AI progress, we introduce a workflow for easily accelerating brain simulations using TensorFlow and evaluate the performance of various, cutting-edge AI chips – including the Graphcore Intelligence-Processing Unit (IPU), GroqChip, Nvidia GPU with Tensor Cores, and Google Tensor-Processing Unit (TPU) – when simulating a biologically detailed as well as simpler brain models. Our model simulations explore the architectural tradeoffs of a modern-day CPU and these four AI platforms by varying computational density, memory requirements and floating-point numerical accuracy. Results show that the GroqChip achieves the best performance for small networks, yet is unable to simulate large-scale networks. At the scale of mammalian brains, the GPU, IPU and TPU achieve speedups ranging from 29x to 1,208x times over CPU runtimes. Remarkably, the TPU sets a new record for the largest, real-time simulation of the inferior-olivary nucleus in the brain. Reduced-accuracy floating-point implementations make some simulation results unreliable for brain research, notably for the GroqChip. Consequently, this work underscores the potential of ML libraries for accelerating brain simulations as well as the critical role of AI-chip numerical accuracy for biophysically realistic brain models.

Read full abstract

The demand for computation driven by machine learning and deep learning applications has experienced exponential growth over the past five years (Sevilla et al 2022 2022 International Joint Conference on Neural Networks (IJCNN) (IEEE) pp 1-8), leading to a significant surge in computing hardware products. Meanwhile, this rapid increase has exacerbated the memory wall bottleneck within mainstream Von Neumann architectures (Hennessy and Patterson et al 2011 Computer architecture: a quantitative approach (Elsevier)). For instance, NVIDIA graphical processing units (GPUs) have gained nearly a 200x increase in fp32 computing power, transitioning from P100 to H100 in the last five years (NVIDIA Tesla P100 2023 (www.nvidia.com/en-us/data-center/tesla-p100/); NVIDIA H100 Tensor Core GPU 2023 (www.nvidia.com/en-us/data-center/h100/)), accompanied by a mere 8x scaling in memory bandwidth. Addressing the need to mitigate data movement challenges, process-in-memory designs, especially resistive random-access memory (ReRAM)-based solutions, have emerged as compelling candidates (Verma et al 2019 IEEE Solid-State Circuits Mag. 11 43–55; Sze et al 2017 Proc. IEEE 105 2295–329). However, this shift in hardware design poses distinct challenges at the design phase, given the limitations of existing hardware design tools. Popular design tools today can be used to characterize analog behavior via SPICE tools (PrimeSim HSPICE 2023 (www.synopsys.com/implementation-and-signoff/ams-simulation/primesim-hspice.html)), system and logical behavior using Verilog tools (VCS 2023 (www.synopsys.com/verification/simulation/vcs.html)), and mixed signal behavior through toolbox like CPPSIM (Meninger 2023 (www.cppsim.org/Tutorials/wideband_fracn_tutorial.pdf)). Nonetheless, the design of in-memory computing systems, especially those involving non-CMOS devices, presents a unique need for characterizing mixed-signal computing behavior across a large number of cells within a memory bank. This requirement falls beyond the scope of conventional design tools. In this paper, we bridge this gap by introducing the ReARTSim framework—a GPU-accelerated mixed-signal transient simulator for analyzing ReRAM crossbar array. This tool facilitates the characterization of analog circuit and device behavior on a large scale, while also providing enhanced simulation performance for complex algorithm analysis, sign-off, and verification.

Read full abstract

Tensor Cores Research Articles

Related Topics

Articles published on Tensor Cores

RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion

GPU Parallel Computing Architectures : Unlocking the Power of Parallelism for High-Performance Applications

120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training

Tensor Network State Algorithms on AI Accelerators.

Acceleration of Tensor-Product Operations with Tensor Cores

Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs

Efficient Mixed-Precision Matrix Factorization of the Inverse Overlap Matrix in Electronic Structure Calculations with AI-Hardware and GPUs.

Interactive Volume Visualization via Multi-Resolution Hash Encoding Based Neural Representation.

Partial coherence enhances parallelized photonic computing

Sparse Tensor Decomposition of Multi-Sensory Data for Fault Localization in Rotating Machinery Health Monitoring

Tricking AI chips into simulating the human brain: A detailed performance analysis

Enhancing RAG Systems: A Survey of Optimization Strategies for Performance and Scalability

Generating Multi-Depth 3D Holograms Using a Fully Convolutional Neural Network.

Accelerating range minimum queries with ray tracing cores

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

DGEMM on integer matrix multiplication unit

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

ReARTSim: an ReRAM ARray Transient Simulator with GPU optimized runtime acceleration

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Atomic Structure and Dynamics of Unusual and Wide‐Gap Phase‐Change Chalcogenides: A GeTe2 Case

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Tensor Cores Research Articles

Related Topics

Articles published on Tensor Cores

RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion

GPU Parallel Computing Architectures : Unlocking the Power of Parallelism for High-Performance Applications

120 GOPS Photonic tensor core in thin-film lithium niobate for inference and in situ training

Tensor Network State Algorithms on AI Accelerators.

Acceleration of Tensor-Product Operations with Tensor Cores

Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs

Efficient Mixed-Precision Matrix Factorization of the Inverse Overlap Matrix in Electronic Structure Calculations with AI-Hardware and GPUs.

Interactive Volume Visualization via Multi-Resolution Hash Encoding Based Neural Representation.

Partial coherence enhances parallelized photonic computing

Sparse Tensor Decomposition of Multi-Sensory Data for Fault Localization in Rotating Machinery Health Monitoring

Tricking AI chips into simulating the human brain: A detailed performance analysis

Enhancing RAG Systems: A Survey of Optimization Strategies for Performance and Scalability

Generating Multi-Depth 3D Holograms Using a Fully Convolutional Neural Network.

Accelerating range minimum queries with ray tracing cores

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

DGEMM on integer matrix multiplication unit

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

ReARTSim: an ReRAM ARray Transient Simulator with GPU optimized runtime acceleration

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Atomic Structure and Dynamics of Unusual and Wide‐Gap Phase‐Change Chalcogenides: A GeTe2 Case