FPGA-Accelerated Analytics: From Single Nodes to Clusters

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

In this monograph, we survey recent research on using reconfigurable hardware accelerators, namely, Field Programmable Gate Arrays (FPGAs), to accelerate analytical processing. Such accelerators are being adopted as a way of overcoming the recent stagnation in CPU performance because they can implement algorithms differently from traditional CPUs, breaking traditional trade-offs. As such, it is timely to discuss their benefits in the context of analytical processing, both as an accelerator within a single node database and as part of distributed data analytics pipelines. We present guidelines for accelerator design in both scenarios, as well as, examples of integration within full-fledged Relational Databases. We do so through the prism of recent research projects that explore how emerging compute-intensive operations in databases can benefit from FPGAs. Finally, we highlight future research challenges in programmability and integration, and cover architectural trends that are propelling the rapid adoption of accelerators in datacenters and the cloud.

Similar Papers
  • Research Article
  • Cite Count Icon 16
  • 10.1016/j.crad.2017.12.015
Staging nodal metastases in nasopharyngeal carcinoma: which method should be used to measure nodal dimension on MRI?
  • Apr 7, 2018
  • Clinical Radiology
  • Q.-Y Ai + 7 more

Staging nodal metastases in nasopharyngeal carcinoma: which method should be used to measure nodal dimension on MRI?

  • Research Article
  • Cite Count Icon 6
  • 10.1093/comjnl/bxq017
Energy-Aware Distributed QR Decomposition on Wireless Sensor Nodes
  • Feb 22, 2010
  • The Computer Journal
  • S Abdelhak + 4 more

Wireless sensor networks (WSNs) are starting to mature into the next generation where they can be used for adaptive filtering and signal processing, breaking away from the current generation of microcontroller applications. The tasks involved, however, are computationally intensive and strain the energy resources of any single computational sensor node. Moreover, most sensor nodes do not have the computational resources to complete many of these tasks repeatedly. Hence, exploring distributed processing on WSNs becomes a necessity to enable such computational load to be processed in real-time. In this work, a new distributed QR decomposition algorithm, on WSNs, is developed and implemented. QR decomposition has prominent applications in adaptive filtering which is essential for many WSN applications, such as target tracking and beamforming. The contributions of this work can be summarized as follows: (i) developing a new scalable tile-based distributed QR decomposition algorithm, (ii) distributing the least-squares problem based on the proposed distribution of the QR decomposition, (iii) developing resource-aware task allocation and mapping and (iv) developing a simple decentralized transmission scheduling scheme to guarantee efficient operation. This work demonstrates that distributed processing on WSNs paves the way for larger computations beyond the capabilities of a single node. This is accomplished while decreasing the energy per node and increasing the speed of the computation versus the implementation on a single node. The experiments, on a test bed of Telosb sensor nodes, prove that the proposed distributed algorithm enables higher computational capabilities while reducing the energy per node by up to 91.93% and speeding up the computation by up to 79.29% compared with running the QR decomposition on a single node, thus laying the foundation for energy-feasible real-time in-network processing.

  • Research Article
  • Cite Count Icon 6
  • 10.1007/s11390-013-1334-4
SR-MAC: A Low Latency MAC Protocol for Multi-Packet Transmissions in Wireless Sensor Networks
  • Mar 1, 2013
  • Journal of Computer Science and Technology
  • Hong-Wei Tang + 3 more

Event detection is one of the major applications of wireless sensor networks (WSNs). Most of existing medium access control (MAC) protocols are mainly optimized for the situation under which an event only generates one packet on a single sensor node. When an event generates multiple packets on a single node, the performance of these MAC protocols degrades rapidly. In this paper, we present a new synchronous duty-cycle MAC protocol called SR-MAC for the event detection applications in which multiple packets are generated on a single node. SR-MAC introduces a new scheduling mechanism that reserves few time slots during the SLEEP period for the nodes to transmit multiple packets. By this approach, SR-MAC can schedule multiple packets generated by an event on a single node to be forwarded over multiple hops in one operational cycle without collision. We use event delivery latency (EDL) and event delivery ratio (EDR) to measure the event detection capability of the SR-MAC protocol. Through detailed ns-2 simulation, the results show that SR-MAC can achieve lower EDL, higher EDR and higher network throughput with guaranteed energy efficiency compared with R-MAC, DW-MAC and PR-MAC.

  • Research Article
  • Cite Count Icon 80
  • 10.1016/0090-8258(90)90428-n
Prognostic significance of single versus multiple lymph node metastases in cervical carcinoma stage IB
  • Nov 1, 1990
  • Gynecologic Oncology
  • D.J Tinga + 3 more

Prognostic significance of single versus multiple lymph node metastases in cervical carcinoma stage IB

  • Research Article
  • Cite Count Icon 19
  • 10.1103/physrevlett.122.046402
Unpaired Weyl Nodes from Long-Ranged Interactions: Fate of Quantum Anomalies.
  • Feb 1, 2019
  • Physical Review Letters
  • Tobias Meng + 1 more

We study the effect of long-ranged interactions on Weyl semimetals. Such interactions can give rise to unpaired Weyl nodes, which we demonstrate by explicitly constructing a system with just a single node-a situation that is fundamentally forbidden by fermion doubling in noninteracting band structures. Adding a magnetic field, we investigate the fate of the chiral anomaly. Remarkably, as long as a system exhibits a single Weyl node in the absence of magnetic fields, arbitrarily weak fields qualitatively restore the lowest Landau level structure of a noninteracting Weyl semimetal. This underlines the universality of the chiral anomaly in the context of Weyl semimetals. We furthermore demonstrate how the topologically protected Fermi-arc surface states are modified by long-ranged interactions.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/hpec.2019.8916378
Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels
  • Sep 1, 2019
  • J Austin Ellis + 1 more

Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icdcs.2004.1281627
On the confidential auditing of distributed computing systems
  • Jan 1, 2004
  • Y Shen + 3 more

We propose a confidential logging and auditing service for distributed information systems. We propose a cluster-based TTP (trusted third party) architecture for the event log auditing services, so that no single TTP node can have the full knowledge of the logs, and thus no single node can misuse the log information without being detected. On the basis of a relaxed form of secure distributed computing paradigms, one can implement confidential auditing service so that the auditor can retrieve certain aggregated system information, e.g. the number of transactions, the total volume, the event traces, etc., without having to access the full log data. Similar to the peer relationship of routers to provide global network routing services, the mutually supported, mutually monitored cluster TTP architecture allows independent systems to collaborate in network-wide auditing without compromising their private information.

  • Abstract
  • 10.1182/blood-2019-126504
Modulation of the IL-6/STAT3 Signaling Axis in CD4+ T Cells As a Potential Immune Mechanism of Action of Azacytidine in High-Risk Myelodysplastic Syndromes
  • Nov 13, 2019
  • Blood
  • Eleftheria Lamprianidou + 17 more

Modulation of the IL-6/STAT3 Signaling Axis in CD4+ T Cells As a Potential Immune Mechanism of Action of Azacytidine in High-Risk Myelodysplastic Syndromes

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icecet52533.2021.9698764
Performance analysis of different distribution of Python and TensorFlow to efficiently utilize CPU on HPC Cluster
  • Dec 9, 2021
  • Krishan Gopal Gupta + 3 more

In recent years, Artificial Intelligence (AI) and Deep Learning (DL) Research and Development (R&D) work has picked up amongst academia, research community and industry. Training Deep Neural Network model (DNN) requires huge amount of data and computing resources, specially Graphics Processing Unit (GPU) accelerator along with DL framework like TensorFlow and programing language like Python. Different distribution of framework and programing language is available and optimized for GPU and Central Processing Unit (CPU). Many literature, article and studies are published for DNN training on GPU. However, only few article and studied are available for DNN training on CPU especially in HPC Cluster. This paper presents performance analysis and comparison between different distribution of Python and TensorFlow to verify which combination run optimally on available CPU only node of a HPC cluster. We used ResNet50 [1], ResNet101 [1] and Inceptionv3 [2] neural network model of tf_cnn_benchmarks [3] for performance comparison. We further tune best identified software combination using distributed training technique, across single and multiple nodes. We did performance comparison based on different processor and architectures. We were able to show up to 7x performance improvement using Intel Distribution for TensorFlow on single node and up to 15.7X speedup on 16 nodes on different CPU architecture.

  • Research Article
  • Cite Count Icon 2
  • 10.1088/1742-6596/180/1/012040
Communication-optimal iterative methods
  • Jul 1, 2009
  • Journal of Physics: Conference Series
  • J Demmel + 3 more

Data movement, both within the memory system of a single processor node and between multiple nodes in a system, limits the performance of many Krylov subspace methods that solve sparse linear systems and eigenvalue problems. Here, s iterations of algorithms such as CG, GMRES, Lanczos, and Arnoldi perform s sparse matrix-vector multiplications and Ω(s) vector reductions, resulting in a growth of Ω(s) in both single-node and network communication. By reorganizing the sparse matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform s iterations by sending O(log P) messages instead of Ω(s·log P) messages on a parallel machine, and reading the on-node components of the matrix A from DRAM to cache just once on a single node instead of s times. This reduces communication to the minimum possible. We discuss both algorithms and an implementation of GMRES on a single node of an 8-core Intel Clovertown. Our implementations achieve significant speedups over the conventional algorithms.

  • Research Article
  • 10.1145/3729491
Combating Chirp Interference for Multi-target LoRa Localization
  • Jun 9, 2025
  • Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
  • Qiling Xu + 5 more

The long-range and low-power properties of LoRa facilitate its rapid deployment in many location-based services. However, existing LoRa-based localization techniques assume that the received signal is solely from a single node, without any concurrent transmissions from other LoRa nodes. This is because concurrent transmissions lead to mutual interference that inevitably distorts the estimated channel state information (CSI), resulting in significant localization errors. Although interference caused by concurrent transmissions has been studied and addressed in LoRa communication, none of these methods are effective for LoRa localization. This is because localization relies on distinct features and encounters different challenges compared to LoRa communication. To address this fundamental limitation, we propose CLoc, the first LoRa-based multi-target localization method, which is capable of localizing multiple LoRa nodes simultaneously under concurrent transmissions. Through comprehensive analysis, CLoc classifies the interference into two categories based on the chirp slope, i.e., inter-slope interference (different slopes) and co-slope interference (same slope), and identifies their fundamental impacts on CSI errors. CLoc designs dedicated methods that smartly leverage LoRa chirp characteristics to address CSI distortion caused by inter-slope interference, and tackle CSI ambiguity and errors caused by co-slope interference, thereby enabling accurate CSI estimation. We implement the prototype of CLoc with USRP B210 and commodity LoRa nodes. Evaluations under different settings demonstrate that CLoc achieves median localization errors of 3.3 m in a 293,250 m2 outdoor area and 3.5 m in a 6,750 m2 indoor area, reducing the localization errors by up to 90.6% compared with the state-of-the-art single LoRa node localization method.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/icon.2005.1635636
Addressing some challenges in autonomic monitoring in self-managing networks
  • Jan 1, 2005
  • R Chaparadza + 2 more

Self-monitoring is one of the key expected capabilities of an autonomic system. An autonomic system can be a single network node or the entire network as an entity, having the ability to automatically adjust its behaviour based on the conditions in which the system and its components work. Self-monitoring for the purposes of self-configuration, cooperative event detection by a number of systems in a network, knowledge/information distribution, service-diagnosis and self-protection requires a number of challenges/questions to be addressed. In this paper, we present the concepts behind self-monitoring as a capability of a single node and as a capability of a network that is considered as an entity. We also discuss the challenges and questions to be addressed when designing and deploying self-monitoring mechanisms for a single node and for an entire network and, we present some solutions to these challenges. Because self-monitoring is broad, we limit our focus to self-monitoring applied to inbound/outbound protocol-specific traffic at some point(s). As part of the solution to some challenges we try to address, we introduce the concept of on-demand monitoring (ODM) of protocol-specific traffic.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/itw54588.2022.9965757
Susceptibility of Age of Gossip to Timestomping
  • Nov 1, 2022
  • Priyanka Kaswan + 1 more

We consider a fully connected network consisting of a source that maintains the current version of a file, n nodes that use asynchronous gossip mechanisms to disseminate fresh information in the network, and an adversary who infects the packets at a target node through data timestamp manipulation, with the intent to replace circulation of fresh packets with outdated packets in the network. We show that a single infected node increases the expected age of a fully connected network from O(log n) to O(n). Further, we show that the optimal behavior for an adversary is to reset the timestamps of all outgoing packets to the current time and of all incoming packets to an outdated time. Additionally, if the adversary allows the infected node to accept a small fraction of incoming packets from the network, then a large network can manage to curb the spread of stale files coming from the infected node and pull the network age back to O(log n). Lastly, we show that if an infected node contacts only a single node instead of all nodes of the network, the system age can still be degraded to O(n). These show that fully connected nature of a network can be both a benefit and a detriment for information freshness; full connectivity, while enabling fast dissemination of information, also enables fast dissipation of adversarial inputs.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icassp.2018.8462179
Distributed Large Neural Network with Centralized Equivalence
  • Jan 1, 2018
  • Xinyue Liang + 3 more

In this article, we develop a distributed algorithm for learning a large neural network that is deep and wide. We consider a scenario where the training dataset is not available in a single processing node, but distributed among several nodes. We show that a recently proposed large neural network architecture called progressive learning network (PLN) can be trained in a distributed setup with centralized equivalence. That means we would get the same result if the data be available in a single node. Using a distributed convex optimization method called alternating-direction-method-of-multipliers (ADMM), we perform training of PLN in the distributed setup.

  • Research Article
  • Cite Count Icon 75
  • 10.1016/s0022-5347(17)49112-9
Implications of Volume of Nodal Metastasis in Patients with Adenocarcinoma of the Prostate
  • Apr 1, 1985
  • The Journal of Urology
  • Joseph A Smith + 1 more

Implications of Volume of Nodal Metastasis in Patients with Adenocarcinoma of the Prostate

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.