Abstract
The abstract vector processing unit (VPU) is a virtual VPU that represents a set of real VPUs. It has an idealised instruction set and constraints common to the VPUs that it represents. Together, the idealised instruction set and common constraints allow programs to be portable and to perform efficiently on the real VPUs being represented. The abstract VPU is suitable for programmers who want the performance advantages of using the VPU directly, without being restricted to a particular VPU. An abstract VPU implementation, Vector Virtual Machine (VVM), is presented in this paper. VVM is designed to represent desktop VPUs, such as AltiVec, and to support the creation of a generic, vectorised, machine-vision library. Constraints common to desktop VPUs that VVM are short vectors, fixed vector sizes, and fast access to aligned memory addresses only. To support the creation of a generic, vectorised, machine-vision library, VVM has traits, templated vectors, constant scalar count vectors, and uses consistent functions for scalars and vectors where possible. Because VVM has constant scalar count, each VVM vector can consist of more than one scalar or VPU vector. Since the cost of type conversions is large for VPU programs, VVM only supports explicit type conversions. Ideally, an abstract VPU should have no overheads; that is be zero-cost. Function overloading and expression templates were tested to investigate the possibility of creating a zero-cost implementation of VVM using only Standard C++ and Apple's GCC 3.1 20021003 compiler. Results show that in this environment, zero-cost can be achieved only when processing VVM vectors that contain a single scalar or VPU vector. For expressions involving one to five additions, function overloading was at worst 3.6% slower than hand-coded programs when processing VVM vectors with one scalar or VPU vector. Unfortunately, for VVM vectors with four VPU vectors, function overloading was at worst 23.0% slower. For the same kind of expressions, the expression templates implementation was up to 888.3% slower. It was comparable to hand-coded programs only for expressions involving only one addition between VVM vectors that contain four VPU vectors. When a VPU is available, only char VVM vectors will consist of one VPU vector. In scalar mode, all VVM vectors will consist of one scalar. Thus, with Apple GCC 3.1 20031003, a VVM implementation based on function overloading has minimal overheads only for applications where most operations involve char types. This implementation's performance is adequate for creating a generic, vectorised, machine-vision library because most image processing operations operate on char types.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have