ARM CPU Research Articles

The knowledge of environmental depth is essential in multiple robotics and computer vision tasks for both terrestrial and underwater scenarios. Moreover, the hardware on which this technology runs, generally IoT and embedded devices, are limited in terms of power consumption, and therefore, models with a low-energy footprint are required to be designed. Recent works aim at enabling depth perception using single RGB images on deep architectures, such as convolutional neural networks and vision transformers, which are generally unsuitable for real-time inferences on low-power embedded hardware. Moreover, such architectures are trained to estimate depth maps mainly on terrestrial scenarios due to the scarcity of underwater depth data. Purposely, we present two lightweight architectures based on optimized MobileNetV3 encoders and a specifically designed decoder to achieve fast inferences and accurate estimations over embedded devices, a feasibility study to predict depth maps over underwater scenarios, and an energy assessment to understand which is the effective energy consumption during the inference. Precisely, we propose the MobileNetV3S75 configuration to infer on the 32-bit ARM CPU and the MobileNetV3LMin for the 8-bit Edge TPU hardware. In underwater settings, the proposed design achieves comparable estimations with fast inference performances compared to state-of-the-art methods. Moreover, we statistically proved that the architecture of the models has an impact on the energy footprint in terms of Watts required by the device during the inference. Then, the proposed architectures would be considered to be a promising approach for real-time monocular depth estimation by offering the best trade-off between inference performances, estimation error and energy consumption, with the aim of improving the environment perception for underwater drones, lightweight robots and Internet of things.

Visual Odometry (VO) systems are widely used to determine the position and orientation of a robot or camera in an unknown environment. They are deployed on resource-constrained platforms, such as drones and Virtual Reality (VR) or Augmented Reality (AR) headsets. VO systems harnessing modern System-on-Chip (SoCs) with integrated Field Programmable Gate Array (FPGA) have the potential to improve the overall systems performance. This paper explores the FPGA acceleration of sparse VO kernels using High-level Synthesis (HLS) as this kind of VO system has been designed to use with low-power SoCs. We show that both computational and data transfer overheads between the processing cores of the CPU of the SoC and the accelerators on the FPGA need to be optimized to obtain better end-to-end performance. This is a result of the additional data movement incurred when using an FPGA accelerator and also because of the sparse computational nature with predictable or random memory access patterns of the kernels involved. However, state-of-the-art HLS tools are not yet able to perform the required optimizations automatically because they usually assume that the kernels to be accelerated have dense computational patterns with regular memory access. In this paper we propose three, potentially generic, methods to reduce the data transfer between the CPU and the customised hardware kernels on the FPGA; these methods are: (a) approximation based on domain-specific knowledge, (b) image compression, and (c) the use of on-the-fly computation. We present a case study of the use of these methods on SVO, a state-of-the-art sparse VO system with a semi-direct front-end. We demonstrate that our proposed methods can reduce data transfer overhead to achieve better end-to-end performance and that they can be applied not only when using standard Xilinx HLS tools but also with other state-of-the-art HLS tools, such as HeteroFlow. Compared to the baseline performance of the original SVO software on an Arm CPU, our proposed methods assist the HLS and HeteroFlow designs to achieve a speedup of 2.4x and 2.14x, respectively, without noticeable accuracy loss. The HLS and HeteroFlow designs also achieve a 1.85x and 1.89x, respectively, improvement in energy efficiency on the SoC system used. Compared to the SVO software baseline running on the Intel Xeon CPU, our proposed methods assist the HLS and HeteroFlow designs to achieve 8.2x and 8.3x improvement in energy efficiency, respectively.

ARM CPU Research Articles

Articles published on ARM CPU

A generic deep learning architecture optimization method for edge device based on start-up latency reduction

Operational experience and R&D results using the Google Cloud for High-Energy Physics in the ATLAS experiment

Pre-Training Clustering Models to Summarize Vietnamese Texts

Porting LHAASO WFCTA simulation job to ARM computing cluster

Analyzing Vectorized Hash Tables across CPU Architectures

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed.

Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios.

Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis

Fast Computation of RFD-Like Descriptors in Four Orientations

A Lightweight Network Deployed on ARM Devices for Hand Gesture Recognition

Streaming Accelerator Design for Regular Expression on CPU+FPGA Embedded System

Evaluation of gem5 for performance modeling of ARM Cortex-R based embedded SoCs

Accelerating Video Captioning on Heterogeneous System Architectures

A Configurable and Real-Time Multi-Frequency 3D Image Signal Processor for Indirect Time-of-Flight Sensors

GROMACS in the Cloud: A Global Supercomputer to Speed Up Alchemical Drug Design.

EC-ECC: Accelerating Elliptic Curve Cryptography for Edge Computing on Embedded GPU TX2

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

Direct Numerical Simulation of Stratified Turbulent Flows and Passive Tracer Transport on HPC Systems: Comparison of CPU Architectures

GSM Based Automatic Irrigation System

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

ARM CPU Research Articles

Articles published on ARM CPU

A generic deep learning architecture optimization method for edge device based on start-up latency reduction

Operational experience and R&amp;D results using the Google Cloud for High-Energy Physics in the ATLAS experiment

Pre-Training Clustering Models to Summarize Vietnamese Texts

Porting LHAASO WFCTA simulation job to ARM computing cluster

Analyzing Vectorized Hash Tables across CPU Architectures

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed.

Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios.

Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis

Fast Computation of RFD-Like Descriptors in Four Orientations

A Lightweight Network Deployed on ARM Devices for Hand Gesture Recognition

Streaming Accelerator Design for Regular Expression on CPU+FPGA Embedded System

Evaluation of gem5 for performance modeling of ARM Cortex-R based embedded SoCs

Accelerating Video Captioning on Heterogeneous System Architectures

A Configurable and Real-Time Multi-Frequency 3D Image Signal Processor for Indirect Time-of-Flight Sensors

GROMACS in the Cloud: A Global Supercomputer to Speed Up Alchemical Drug Design.

EC-ECC: Accelerating Elliptic Curve Cryptography for Edge Computing on Embedded GPU TX2

ALPINE: Analog In-Memory Acceleration with Tight Processor Integration for Deep Learning

Direct Numerical Simulation of Stratified Turbulent Flows and Passive Tracer Transport on HPC Systems: Comparison of CPU Architectures

GSM Based Automatic Irrigation System

Operational experience and R&D results using the Google Cloud for High-Energy Physics in the ATLAS experiment