Cache Prefetching Research Articles

R-tree는 일반적으로 트리 노드의 크기를 디스크 페이지의 크기와 같게 함으로써 I/O 성능이 최적화 되도록 구현한다. 최근에는 주메모리 환경에서 CPU 캐시 성능을 최적화하는 R-tree의 변형이 개발되었다. 이는 노드의 크기를 캐시 라인 크기의 수 배로 하고 MBR에 저장되는 키를 압축하여 노드 하나에 더 많은 엔트리를 저장함으로써 성능을 높였다. 그러나, 디스크 최적 R-tree와 캐시 최적 R-tree의 노드 크기 사이에는 수십-수백 바이트와 수-수십 킬로바이트라는 큰 차이가 있으므로, I/O 최적 R-tree는 캐시 성능이 나쁘고 캐시 최적 R-tree는 디스크 I/O 성능이 나쁜 문제점을 가지고 있다. 이 논문에서는 CPU 캐시와 디스크 I/O에 모두 최적인 R-tree, PR-tree를 제안한다. 캐시 성능을 위해 PR-tree 노드의 크기를 캐시 라인 크기보다 크게 만든 다음 CPU의 선반입(prefetch) 명령어를 이용하여 캐시 실패 횟수를 줄이고, 트리 노드를 디스크 페이지에 낭비가 적도록 배치함으로써 디스크 I/O 성능도 향상시킨다. 또한, 이 논문에서는 PR-tree에서 검색 연산을 수행하는데 드는 캐시 실패 비용을 계산하는 분석 방법을 제시하고, 최적의 캐시와 I/O 성능을 보이는 PR-tree를 구성하기 위해, 가능한 크기의 내부 단말 노드, 중간 노드를 갖는 PR-tree 생성하여 성능을 비교하였다. PR-tree는 디스크 최적 R-tree보다 삽입 연산은 3.5에서 15.1배, 삭제 연산은 6.5에서 15.1배, 범위 질의는 1.3에서 1.9배, k-최근접 질의는 2.7에서 9.7배의 캐시 성능 향상이 있었다. 모든 실험에서 매우 작은 I/O 성능 저하만을 보였다. R-trees have been traditionally optimized for the I/O performance with the disk page as the tree node. Recently, researchers have proposed cache-conscious variations of R-trees optimized for the CPU cache performance in main memory environments, where the node size is several cache lines wide and more entries are packed in a node by compressing MBR keys. However, because there is a big difference between the node sizes of two types of R-trees, disk-optimized R-trees show poor cache performance while cache-optimized R-trees exhibit poor disk performance. In this paper, we propose a cache and disk optimized R-tree, called the PR-tree (Prefetching R-tree). For the cache performance, the node size of the PR-tree is wider than a cache line, and the prefetch instruction is used to reduce the number of cache misses. For the I/O performance, the nodes of the PR-tree are fitted into one disk page. We represent the detailed analysis of cache misses for range queries, and enumerate all the reasonable in-page leaf and nonleaf node sizes, and heights of in-page trees to figure out tree parameters for best cache and I/O performance. The PR-tree that we propose achieves better cache performance than the disk-optimized R-tree: a factor of 3.5-15.1 improvement for one-by-one insertions, 6.5-15.1 improvement for deletions, 1.3-1.9 improvement for range queries, and 2.7-9.7 improvement for k-nearest neighbor queries. All experimental results do not show notable declines of the I/O performance.

캐시 미스에 의한 메모리 참조 명령어는 응용 프로그램의 고속 수행을 방해하는 주 원인이다. 캐시 선인출 기법은 캐시 미스에 따른 지연시간을 줄이는 효과적인 방법이다. 그러나 너무 적극적으로 선인출을 할 경우에는 캐시 오염을 유발시켜 오히려 선인출에 의한 장점을 상쇄시킨다. 본 연구에서는 선인출로 인한 캐시의 오염을 줄이기 위해 필터 테이블을 참조하여 선인출 명령을 수행한 지의 여부를 동적으로 판단하는 적극적 선인출 필터링 기법을 제시한다. 정교한 필터링을 위하여 저장되어 있는 불필요한 선인출 데이터의 주소를 직접 사용하는 축출 주소 참조 방시을 제안하였다. 또한 동적 필터링의 정확성을 늘이기 위하여 선인출 데이터의 캐시로부터의 출입을 증가 시키도록 작은 크기의 선인출 전용 캐시를 사용하였다. 선인출 전용 캐시의 사용으로 인해 유용한 요구 데이터들이 선인출 데이터들로 인하여 밀려나가지 않게 되었고, 또한 직접 주소 참조 방식을 통하여 필터링 정확성이 증가됨으로써 선인출 전용 캐시 내에도 유용한 선인출 데이터들만이 존재하게 되어 캐시 미스 수가 크게 감소되었다. 일반적으로 많이 사용되는 일반 벤치마크 프로그램과 멀티미디어 벤치마크 프로그램들에 대하여 실험한 결과, 제안된 방식의 캐시 미스율은 <TEX>$13.3{\%}$</TEX> 감소하였고,, 기존 방식에 비해 우수한 필터링 정확도를 가짐을 보였다. Memory reference instruction caused by cache miss is the critical factor that limits the processing power of processor. Cache prefetching technique is an effective way to reduce the latency due to memory access. However, excessively aggressive prefetch leads to cache pollution and finally to cancel out the advantage of prefetch. In this study, an active prefetch filtering scheme is introduced which dynamically decides whether to commence prefetching after referring a filtering table to reduce the cache pollution due to unnecessary prefetches. For the precision filtering, an evicted address referencing scheme has been proposed where the filter directly compares the current prefetch address with previous unnecessary prefetch addresses stored in filtering table. Moreover, a small sized exclusive prefetch cache has been introduced to increase the amount of eviction of unnecessarily prefetched addresses to enhance the accuracy of dynamic filtering. The exclusive prefetch cache also prevents useful demand data from being pushed out by prefetched data, while the evicted address direct referencing scheme enables the prefetch cache to keep most of useful prefetch data within its small size. Experimental results from commonly used general and multimedia benchmarks show that the average cache miss ratio has been decreased by <TEX>$13.3{\%}$</TEX> by virtue of enhanced filtering accuracy compared with conventional schemes.

Cache Prefetching Research Articles

Related Topics

Articles published on Cache Prefetching

Loop-Based Instruction Prefetching to Reduce the Worst-Case Execution Time

Data Cache Prefetching With Dynamic Adaptation

Optimizing Instruction Prefetching to Improve Worst-Case Performance for Real-Time Applications

Analyzing the worst-case execution time for instruction caches with prefetching

Runtime Engine for Dynamic Profile Guided Stride Prefetching

Server-Based Data Push Architecture for Multi-Processor Environments

Improving hash join performance through prefetching

WCET analysis of instruction caches with prefetching

Code size reduction by compressing repeated instruction sequences

캐시 주소의 태그 이력을 활용한 에너지 효율적 고성능 데이터 캐시 구조

Power-efficient prefetching for embedded processors

Block-aware instruction set architecture

Prefetch R-tree: 디스크와 CPU 캐시에 최적화된 다차원 색인 구조

Code compression for embedded VLIW processors using variable-to-fixed coding

A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

Optimization of lattice QCD codes for the AMD Opteron processor

Dynamic memory optimization using pool allocation and prefetching

A Cache Optimized Multidimensional Index in Disk-Based Environments

칩의 크기가 제한된 단일칩 프로세서를 위한 레벨 1 캐시구조

선인출 전용 캐시를 이용한 적극적 선인출 필터링 기법

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cache Prefetching Research Articles

Related Topics

Articles published on Cache Prefetching

Loop-Based Instruction Prefetching to Reduce the Worst-Case Execution Time

Data Cache Prefetching With Dynamic Adaptation

Optimizing Instruction Prefetching to Improve Worst-Case Performance for Real-Time Applications

Analyzing the worst-case execution time for instruction caches with prefetching

Runtime Engine for Dynamic Profile Guided Stride Prefetching

Server-Based Data Push Architecture for Multi-Processor Environments

Improving hash join performance through prefetching

WCET analysis of instruction caches with prefetching

Code size reduction by compressing repeated instruction sequences

캐시 주소의 태그 이력을 활용한 에너지 효율적 고성능 데이터 캐시 구조

Power-efficient prefetching for embedded processors

Block-aware instruction set architecture

Prefetch R-tree: 디스크와 CPU 캐시에 최적화된 다차원 색인 구조

Code compression for embedded VLIW processors using variable-to-fixed coding

A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

Optimization of lattice QCD codes for the AMD Opteron processor

Dynamic memory optimization using pool allocation and prefetching

A Cache Optimized Multidimensional Index in Disk-Based Environments

칩의 크기가 제한된 단일칩 프로세서를 위한 레벨 1 캐시구조

선인출 전용 캐시를 이용한 적극적 선인출 필터링 기법