The semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules intertile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile and the balance is the proportion of area in each tile devoted to memory, processing, communication, and off-chip global I/O. If the total chip area is fixed, higher processing power per tile requires large tiles and hence reduces the total number of tiles on the chip. This paper presents SimpleFit, a novel analytical framework that designers can use to reason about the design space of Raw microprocessors. Our model is also generalizable to multiprocessors on a chip. Based on an architectural model, an application model, and a VLSI cost analysis, the framework computes the performance of applications and uses an optimization process to identify designs that will execute these applications most cost-effectively, Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a one billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per tile, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for most applications.
Read full abstract