Design of a fast inner product processor

S P Smith,H C Torng

doi:10.1109/arith.1985.6158974

Abstract

This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost The inner product processor is implemented with a tree of carry propagate or carry save adders; this tree is obtained with the incorporation of three innovations in the conventional multiply/add tree: (1) The leaf-multipliers are expanded into adder subtress, thus achieving an O(logNb) latency, where N denotes the number of elements in a vector and b the number of bits in each element (2) The partial products, to be summed in producing an inner product, are reordered according to their "minimum alignments", bringing approximately a 20% saving in hardware. (3) The reordering also truncates the carry propagation chain in the final propagation stage by 2 log b − 1 positions, significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware.

Full Text