Tesla V100 GPU Research Articles

Adaptive radiation treatment (ART) for locally advanced pancreatic cancer (LAPC) requires consistently accurate segmentation of the extremely mobile gastrointestinal (GI) organs at risk (OAR) including the stomach, duodenum, large and small bowel. Also, due to lack of sufficiently accurate and fast deformable image registration (DIR), accumulated dose to the GI OARs is currently only approximated, further limiting the ability to more precisely adapt treatments. Develop a 3-D Progressively refined joint Registration-Segmentation (ProRSeg) deep network to deformably align and segment treatment fraction magnetic resonance images (MRI)s, then evaluate segmentation accuracy, registration consistency, and feasibility for OAR doseaccumulation. ProRSeg was trained using five-fold cross-validation with 110 T2-weighted MRI acquired at five treatment fractions from 10 different patients, taking care that same patient scans were not placed in training and testing folds. Segmentation accuracy was measured using Dice similarity coefficient (DSC) and Hausdorff distance at 95th percentile (HD95). Registration consistency was measured using coefficient of variation (CV) in displacement of OARs. Statistical comparison to other deep learning and iterative registration methods were done using the Kruskal-Wallis test, followed by pair-wise comparisons with Bonferroni correction applied for multiple testing. Ablation tests and accuracy comparisons against multiple methods were done. Finally, applicability of ProRSeg to segment cone-beam CT (CBCT) scans was evaluated on a publicly available dataset of 80 scans using five-foldcross-validation. ProRSeg processed 3D volumes (128 × 192 × 128) in 3 s on a NVIDIA Tesla V100 GPU. It's segmentations were significantly more accurate ( ) than compared methods, achieving a DSC of 0.94 ±0.02 for liver, 0.88±0.04 for large bowel, 0.78±0.03 for small bowel and 0.82±0.04 for stomach-duodenum from MRI. ProRSeg achieved a DSC of 0.72±0.01 for small bowel and 0.76±0.03 for stomach-duodenum from public CBCT dataset. ProRSeg registrations resulted in the lowest CV in displacement (stomach-duodenum : 0.75%, : 0.73%, and : 0.81%; small bowel : 0.80%, : 0.80%, and : 0.68%; large bowel : 0.71%, : 0.81%, and : 0.75%). ProRSeg based dose accumulation accounting for intra-fraction (pre-treatment to post-treatment MRI scan) and inter-fraction motion showed that the organ dose constraints were violated in four patients for stomach-duodenum and for three patients for small bowel. Study limitations include lack of independent testing and ground truth phantom datasets to measure dose accumulationaccuracy. ProRSeg produced more accurate and consistent GI OARs segmentation and DIR of MRI and CBCTs compared to multiple methods. Preliminary results indicates feasibility for OAR dose accumulation usingProRSeg.

Read full abstract

Dynamic graph traversals (DGTs) currently are widely used in many important application domains, especially in this big-data era that urgently demands high-performance graph processing and analysis. Unlike static graph traversals, DGTs in real-world application scenarios require not only fast traversal acceleration itself but also, more importantly, a runtime strategy that can effectively accommodate the ever-evolving nature of the graph structure updates followed by a diverse range of graph traversal algorithms . Because of these special features, state-of-the-art designs on conventional compute-centric architectures (e.g., CPU and GPU) struggle to provide sufficient acceleration for DGT processing due to the dominating irregular memory access patterns in graph traversal algorithms and inefficient platform-specific update mechanisms. In this article, we explore the algorithmic features and runtime requirements of real-world DGTs and identify their unique opportunities of acceleration on the recent Micron Automata Processor (AP), an in-situ memory-centric pattern-matching architecture. These features include the natural mapping between traversal algorithms’ path exploration pattern to classic non-deterministic finite automata processing, AP’s architectural and compilation support for DGTs’ evolving traversal operations, and its inherent hardware fitness. However, despite these benefits, enabling highly efficient DGT execution on AP is non-trivial and faces several major challenges. To tackle them, we propose DynamAP , the first AP framework design that enables fast processing for general DGTs. DynamAP is oblivious to periodical traversal algorithm changes and can address the significant overhead caused by frequent graph updates and AP recompilation through our novel hybrid macro designs and associated efficient updating strategies. We evaluate DynamAP against the current DGT designs on a CPU, GPU, and AP with a range of widely adopted DGT algorithms and real-world graphs. For a single update request , our DynamAP achieves an average speedup of 21.3x (up to 39.2x ) over the state-of-the-art implementation on host-AP architecture; an average speedup of 9.2x (up to 14.7x ) and 1.7x (up to 2.8x ) over two highly optimized DGT design frameworks on a 64-GB Intel(R) Xeon CPU and a 32-GB NVIDIA Tesla V100 GPU. DynamAP also maintains high performance and resource utilization for high graph update ratios, and can significantly benefit natural graphs that present a high average vertex degree.

Read full abstract

Tesla V100 GPU Research Articles

Related Topics

Articles published on Tesla V100 GPU

GPU accelerated Staggered Update Procedure (SUP)

Codebase release 1.1 for GAPS

An algorithm to parallelise parton showers on a GPU

Automated reporting of cervical biopsies using artificial intelligence.

IDNet: Information Decomposition Network for Fast Panoptic Segmentation.

CHIP-KNNv2: AConfigurable andHigh-PerformanceK-NearestNeighbors Accelerator on HBM-based FPGAs

Accelerating Graph Convolutional Networks Through a PIM-Accelerated Approach

GPU implementation of the discrete unified gas kinetic scheme for low-speed isothermal flows

Layered mixed-precision training: A new training method for large-scale AI models

MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing

Progressively refined deep joint registration segmentation (ProRSeg) of gastrointestinal organs at risk: Application to MRI and cone-beam CT.

Efficient parallelization for 3D-3V sparse grid Particle-In-Cell: Single GPU architectures

Heterogeneous acceleration algorithms for shallow cumulus convection scheme over GPU clusters

Acceleration of the data-parallel lower-upper relaxation time-integration method on GPU for an unstructured CFD solver

High-speed implementation of rainbow table method on heterogeneous multi-device architecture

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

Segment Linking: A Highly Parallelizable Track Reconstruction Algorithm for HL-LHC

Real-time masked face classification and head pose estimation for RGB facial image via knowledge distillation

DynamAP: Architectural Support for Dynamic Graph Traversal on the Automata Processor

Critique of “MemXCT: Memory-Centric X-Ray CT Reconstruction With Massive Parallelization” by SCC Team From Georgia Tech

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Tesla V100 GPU Research Articles

Related Topics

Articles published on Tesla V100 GPU

GPU accelerated Staggered Update Procedure (SUP)

Codebase release 1.1 for GAPS

An algorithm to parallelise parton showers on a GPU

Automated reporting of cervical biopsies using artificial intelligence.

IDNet: Information Decomposition Network for Fast Panoptic Segmentation.

CHIP-KNNv2: AConfigurable andHigh-PerformanceK-NearestNeighbors Accelerator on HBM-based FPGAs

Accelerating Graph Convolutional Networks Through a PIM-Accelerated Approach

GPU implementation of the discrete unified gas kinetic scheme for low-speed isothermal flows

Layered mixed-precision training: A new training method for large-scale AI models

MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing

Progressively refined deep joint registration segmentation (ProRSeg) of gastrointestinal organs at risk: Application to MRI and cone-beam CT.

Efficient parallelization for 3D-3V sparse grid Particle-In-Cell: Single GPU architectures

Heterogeneous acceleration algorithms for shallow cumulus convection scheme over GPU clusters

Acceleration of the data-parallel lower-upper relaxation time-integration method on GPU for an unstructured CFD solver

High-speed implementation of rainbow table method on heterogeneous multi-device architecture

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

Segment Linking: A Highly Parallelizable Track Reconstruction Algorithm for HL-LHC

Real-time masked face classification and head pose estimation for RGB facial image via knowledge distillation

DynamAP: Architectural Support for Dynamic Graph Traversal on the Automata Processor

Critique of “MemXCT: Memory-Centric X-Ray CT Reconstruction With Massive Parallelization” by SCC Team From Georgia Tech