Reduce Memory Access Research Articles

Purpose: (1) To perform phase space (PS) based source modeling for Tomotherapy and Varian TrueBeam 6 MV Linacs, (2) to examine the accuracy and performance of the ARCHER Monte Carlo code on a heterogeneous computing platform with Many Integrated Core coprocessors (MIC, aka Xeon Phi) and GPUs, and (3) to explore the software micro-optimization methods. Methods: The patient-specific source of Tomotherapy and Varian TrueBeam Linacs was modeled using the PS approach. For the helical Tomotherapy case, the PS data were calculated in our previous study (Su et al. 2014 41(7) Medical Physics). For the single-view Varian TrueBeam case, we analytically derived them from the raw patient-independent PS data in IAEA's database, partial geometry information of the jaw and MLC as well as the fluence map. The phantom was generated from DICOM images. The Monte Carlo simulation was performed by ARCHER-MIC and GPU codes, which were benchmarked against a modified parallel DPM code. Software micro-optimization was systematically conducted, and was focused on SIMD vectorization of tight for-loops and data prefetch, with the ultimate goal of increasing 512-bit register utilization and reducing memory access latency. Results: Dose calculation was performed for two clinical cases, a Tomotherapy-based prostate cancer treatment and a TrueBeam-based left breast treatment. ARCHER was verified against the DPM code. The statistical uncertainty of the dose to the PTV was less than 1%. Using double-precision, the total wall time of the multithreaded CPU code on a X5650 CPU was 339 seconds for the Tomotherapy case and 131 seconds for the TrueBeam, while on 3 5110P MICs it was reduced to 79 and 59 seconds, respectively. The single-precision GPU code on a K40 GPU took 45 seconds for the Tomotherapy dose calculation. Conclusion: We have extended ARCHER, the MIC and GPU-based Monte Carlo dose engine to Tomotherapy and Truebeam dose calculations.

As the advance of memory technologies, multiple types of memories such as different kinds of non-volatile memory (NVM), SRAM, DRAM, etc. provide a flexible configuration considering performance, energy and cost. For improving the performance of systems with multiple types of memories, data allocation is one of the most important tasks. The previous studies on data allocation problem assume the worst (fixed) case of data-access frequencies. However, the data allocation produced by employing worst case usually leads to an inferior performance for most of time. In this paper, we model this problem by probabilities and design efficient algorithms that can give optimal-cost data allocation with a guaranteed probability. We propose DAGP algorithm produces a set of feasible data allocation solutions which generates the minimum access time or cost guaranteed by a given probability. We also propose a polynomial-time algorithm, MCS algorithm, to solve this problem. The experiments show that our technique can significantly reduce the access cost compared with the technique considering worst case scenario. For example, comparing with the optimal result generated by employing the worst cases, DAGP can reduce memory access cost by 9.92 % on average when guaranteed probability is set to be 0.9. Moreover, for 90 percents of cases, memory access time is reduced by 12.47 % on average. Comparing with greedy algorithm, DAGP and MCS can reduce memory access cost by 78.92 % and 44.69 % on average when guaranteed probability is set to be 0.9.

Reduce Memory Access Research Articles

Related Topics

Articles published on Reduce Memory Access

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

A Fast MPEG’s CDVS Implementation for GPU Featured in Mobile Devices

Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems

Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors

Lossless image compression algorithm and hardware architecture for bandwidth reduction of external memory

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

Memory-efficient table look-up optimized algorithm for context-based adaptive variable length decoding in H.264/advanced video coding

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

첨단운전자보조시스템용 이동객체검출을 위한 광학흐름추정기의 설계 및 구현

Data Allocation for Hybrid Memory With Genetic Algorithm

Architecture and data migration methodology for L1 cache design with hybrid SRAM and volatile STT-RAM configuration

Hybrid Main Memory for High Bandwidth Multi-Core System

Multiple clone row DRAM

A Vocabulary Forest Object Matching Processor With 2.07 M-Vector/s Throughput and 13.3 nJ/Vector Per-Vector Energy for Full-HD 60 fps Video Object Recognition

Data Allocation with Minimum Cost under Guaranteed Probability for Multiple Types of Memories

Implementation of Multi-GPU Based Lattice Boltzmann Method for Flow Through Porous Media

Fast bitwise pattern-matching algorithm for DNA sequences on modern hardware

Selective Cache Line Replication Scheme in Shared Last Level Cache

An Overview of H.264 Hardware Encoder Architectures Including Low-Power Features

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Reduce Memory Access Research Articles

Related Topics

Articles published on Reduce Memory Access

NUMA-Aware Thread Scheduling for Big Data Transfers over Terabits Network Infrastructure

A Fast MPEG’s CDVS Implementation for GPU Featured in Mobile Devices

Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems

Scalable Bandwidth Shaping Scheme via Adaptively Managed Parallel Heaps in Manycore-Based Network Processors

Lossless image compression algorithm and hardware architecture for bandwidth reduction of external memory

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

TU-AB-BRC-10: Modeling of Radiotherapy Linac Source Terms Using ARCHER Monte Carlo Code: Performance Comparison of GPU and MIC Computing Accelerators

Memory-efficient table look-up optimized algorithm for context-based adaptive variable length decoding in H.264/advanced video coding

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

첨단운전자보조시스템용 이동객체검출을 위한 광학흐름추정기의 설계 및 구현

Data Allocation for Hybrid Memory With Genetic Algorithm

Architecture and data migration methodology for L1 cache design with hybrid SRAM and volatile STT-RAM configuration

Hybrid Main Memory for High Bandwidth Multi-Core System

Multiple clone row DRAM

A Vocabulary Forest Object Matching Processor With 2.07 M-Vector/s Throughput and 13.3 nJ/Vector Per-Vector Energy for Full-HD 60 fps Video Object Recognition

Data Allocation with Minimum Cost under Guaranteed Probability for Multiple Types of Memories

Implementation of Multi-GPU Based Lattice Boltzmann Method for Flow Through Porous Media

Fast bitwise pattern-matching algorithm for DNA sequences on modern hardware

Selective Cache Line Replication Scheme in Shared Last Level Cache

An Overview of H.264 Hardware Encoder Architectures Including Low-Power Features