Abstract

We examine multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision floating point operations per second. We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems. Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call