After a long period of intense research conducted mainly in academia and supported by government agencies, parallel processing has moved to industry and is deriving the bulk of its support from the commercial world. Such a move has brought with it a change of emphasis from record-breaking performance to price-performance ratios and sustained speed of application program execution. The current parallel architectures are fast and economically sound. As a result, there is a strong trend towards widening the application base of parallel processing both in terms of hardware and software. At the same time, quickly developing technology fundamentally changes balances between cost of computing, communication and programming, and new balances often lead to new approaches. On the hardware side, the prevailing tendency is to use off-the-shelf commercially available components (processors and interconnection switches) that benefit from the rapid pace of technological advancement fueled by Moore’s Law. The other tendency is the convergence of different architectural approaches as the successful ones spread to new systems. Workstations interconnected by fast networks approach the performance of special-purpose parallel machines. Shared memory machines with multilevel caches and sophisticated prefetching strategies execute programs with efficiency similar to distributed memory machines. At the hardware level, the essential aspect of a quickly changing landscape is the difference in growth of network bandwidth, processor speed and memory access times, which are listed in the descending order of their speed of improvement. All-optical networks and interconnects are changing the balance on the networking side. Throughput now tends to be limited by processor speed and software overheads rather than by network bandwidth, as was the case in the past. On the one hand, the latency of networks is fundamentally limited by the speed of light and the distance that the transferred data need to travel. In addition, processor speed is growing faster than memory access time, where the technological advances are used to increase the memory chip capacity rather than its speed. The resulting use of buffering to mask the speed differences has led to the multimemory hierarchy in which registers, primary cache, secondary cache and main memory are typical layers with progressively lower speed but larger capacity. One result of these trends is the growing importance of data locality for the performance of computer systems, where the architectural details dictate the structure of the most efficient object code. However, improvements in hardware speed are happening faster than advances in programming efficiency, causing programmer time to become more expensive relative to the cost of hardware. To expect a programmer to find the optimum run-time structure would be contrary to this trend. As a result, the programming trends are towards portability and reuse of software, which require abstraction from the architectural details of the computer. Hence, compilers and run-time systems must be responsible for tuning the portable software for a particular architecture, and we find that research on automatic optimization of data locality both at compile and run time has been growing in importance. The increasing complexity of interactions between processor, memory hierarchy and network in parallel systems must be encapsulated in a proper abstract model capable of providing a universal representation of parallel algorithms. The widening base of the users relies on standardization of parallel programming tools. Parallel programmers face a daunting challenge, especially with increasingly large and complex appli-