Performance Implications of Extended Page Tables on Virtualized x86 Processors

Timothy Merrifield,H Reza Taheri

doi:10.1145/2892242.2892258

Abstract

Managing virtual memory is an expensive operation, and becomes even more expensive on virtualized servers. Process- ing TLB misses on a virtualized x86 server requires a two-dimensional page walk that can have 6x more page table lookups, hence 6x more memory references, than a native page table walk. Thus much of the recent research on the subject starts from the assumption that TLB miss processing in virtual environments is significantly more expensive than on native servers. However, we will show that with the latest software stack on modern x86 processors, most of these page-table lookups are satisfied by internal paging structure caches and the L1/L2 data caches, and the actual virtualization overhead of TLB miss processing is a modest fraction of the overall time spent processing TLB misses.In this paper, we present a detailed accounting of the TLB miss processing costs on virtualized x86 servers for an exhaustive set of workloads, in particular, two very demanding industry standard workloads. We show that an implementation of the TPC-C workload that actively uses 475 GB of memory on a 72-CPU Haswell-EP server spends 20% of its time processing TLB misses when the application runs in a VM. Although this is a non-trivial amount, it is only 4.2% higher than the TLB miss processing costs on bare metal. The multi-VM VMmark benchmark sees 12.3% in TLB miss processing, but only 4.3% of that can be attributed to virtualization overheads. We show that even for the heaviest workloads, a well-tuned application that uses large pages on a recent OS release with a modern hypervisor running on the latest x86 processors sees only minimal degradation from the additional overhead of the two-dimensional page walks in a virtualized server.

Full Text