Small Amount Of Memory Research Articles

The semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules intertile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile and the balance is the proportion of area in each tile devoted to memory, processing, communication, and off-chip global I/O. If the total chip area is fixed, higher processing power per tile requires large tiles and hence reduces the total number of tiles on the chip. This paper presents SimpleFit, a novel analytical framework that designers can use to reason about the design space of Raw microprocessors. Our model is also generalizable to multiprocessors on a chip. Based on an architectural model, an application model, and a VLSI cost analysis, the framework computes the performance of applications and uses an optimization process to identify designs that will execute these applications most cost-effectively, Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a one billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per tile, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for most applications.

Read full abstract

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors. We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.

Read full abstract

Small Amount Of Memory Research Articles

Related Topics

Articles published on Small Amount Of Memory

Fibonacci and Galois representations of feedback-with-carry shift registers

Squeezer: An efficient algorithm for clustering categorical data

New directions in traffic measurement and accounting

Registration of Multiple Acoustic Range Views for Underwater Scene Reconstruction

SVM-based face verification with feature set of small size

SimpleFit: a framework for analyzing design trade-offs in Raw architectures

Minimax real-time heuristic search

Orientation lightmaps for photon tracing in complex environments

Efficient real-time correlator for CDMA2000 searcher

A combined input and output queued packet switched system based on PRIZMA switch on a chip technology

Speech processing coder, decoder and command recognizer

An efficient document clustering algorithm and its application to a document browser

Auto-associative segmentation for real-time object recognition in realistic outdoor images

A multiresolution approach for page segmentation

Identification scheme based on quadratic residue

Diagnostic fault simulation for synchronous sequential circuits

The numerical simulation of Gaussian cross-correlated wind velocity fluctuations by means of a hybrid model

A universal parallel computer architecture

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

PC-based system for transparent fluid film monitoring

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Small Amount Of Memory Research Articles

Related Topics

Articles published on Small Amount Of Memory

Fibonacci and Galois representations of feedback-with-carry shift registers

Squeezer: An efficient algorithm for clustering categorical data

New directions in traffic measurement and accounting

Registration of Multiple Acoustic Range Views for Underwater Scene Reconstruction

SVM-based face verification with feature set of small size

SimpleFit: a framework for analyzing design trade-offs in Raw architectures

Minimax real-time heuristic search

Orientation lightmaps for photon tracing in complex environments

Efficient real-time correlator for CDMA2000 searcher

A combined input and output queued packet switched system based on PRIZMA switch on a chip technology

Speech processing coder, decoder and command recognizer

An efficient document clustering algorithm and its application to a document browser

Auto-associative segmentation for real-time object recognition in realistic outdoor images

A multiresolution approach for page segmentation

Identification scheme based on quadratic residue

Diagnostic fault simulation for synchronous sequential circuits

The numerical simulation of Gaussian cross-correlated wind velocity fluctuations by means of a hybrid model

A universal parallel computer architecture

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

PC-based system for transparent fluid film monitoring