The ring array processor: A multiprocessing peripheral for connectionist applications

Nelson Morgan,James Beck,Phil Kohn,Jeff Bilmes,Eric Allman,Joachim Beer

doi:10.1016/0743-7315(92)90067-w

Abstract

We have designed and implemented a Ring Array Processor (RAP) for fast implementation of our continuous speech recognition training algorithms, which are currently dominated by layered “neural” network calculations. The RAP is a multi-DSP system with a low-latency ring interconnection scheme using programmable gate array technology and a significant amount of local memory per node (4–16 Mbytes of dynamic memory and 256 Kbytes of fast static RAM). Theoretical peak performance is 128 MFLOPS/board. A working system with 20 nodes has been used for our research at rates of 200–300 million connections per second for probability evaluation, and at roughly 30–60 million connection updates per second for training. A fully functional system with 40 nodes has also been benchmarked at roughly twice these rates. While practical considerations such as workstation address space restrict current implementations to 64 nodes, the architecture scales to about 16,000 nodes. For problems with 2 units per processor, communication and control overhead would reduce peak performance on the error back-propagation algorithm to about 50% of a linear speedup. This report describes the motivation for the RAP and shows how the architecture matches the target algorithm. We further describe some of the key features of the hardware and software design.

Full Text