Abstract

This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost The inner product processor is implemented with a tree of carry propagate or carry save adders; this tree is obtained with the incorporation of three innovations in the conventional multiply/add tree: (1) The leaf-multipliers are expanded into adder subtress, thus achieving an O(logNb) latency, where N denotes the number of elements in a vector and b the number of bits in each element (2) The partial products, to be summed in producing an inner product, are reordered according to their "minimum alignments", bringing approximately a 20% saving in hardware. (3) The reordering also truncates the carry propagation chain in the final propagation stage by 2 log b − 1 positions, significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call