Abstract

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, guest memory instructions with strides are emulated by a sequence of scalar instructions, leaving a significant room for performance improvement when the SIMD instructions are available on the host machines. Structured loads/stores, such as VLDn/VSTn in ARM NEON, are one type of strided SIMD data access instructions. They are widely used in signal processing, multimedia, mathematical, and 2D matrix transposition applications. Efficient translation of such structured loads/stores is a critical issue when migrating ARM executables to other ISAs. However, it is quite challenging since not only the translation of structured loads/stores is not trivial, but also the mapping of SIMD registers between guest and host is complicated. This paper presents the design of translating structured loads/stores in DBT, including target code generation, efficient SIMD register mapping, and optimizations for reducing data permutations. Our proposed register mapping mechanisms and optimizations are not limited to handle structured loads/stores, they can be extended to deal with normal SIMD instructions. This paper evaluates how different factors affect the translation performance and code size. These factors include guest SIMD register length, strides, and use cases for structured loads. On a set of OpenCV benchmarks, our QEMU-based system has achieved a maximum speedup of 5.03x, with an average improvement of 2.87x. On a set of BLAS benchmarks, our system has obtained a maximum speedup of 2.22x and an average improvement of 1.78x.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call