Destination Processors Research Articles

Dynamic data redistribution is used to enhance data locality and algorithm performance by reducing interprocessor communication in many parallel scientific applications on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a processor replacement scheme to minimize the cost of interprocessor data exchange during runtime. The main idea of the proposed technique is to develop a replacement function for reordering logical processors in the destination phase. Based on the replacement function, a realigned sequence of destination processors can be derived and is then used to perform data decomposition in the receiving phase. Together with local matrix and compressed CRS vectors transposition schemes, the interprocessor communication can be eliminated during runtime. A significant improvement of this approach is that the realignment of data can be performed without interprocessor communication for special cases. The second contribution of the present technique is that the complicated communication sets generation could be simplified by applying local matrix transposition. Consequently, the indexing cost could be reduced significantly. The proposed techniques can be applied in both dense and sparse applications. A generalized symmetric redistribution algorithm is also presented in this work. To analyze the efficiency of the proposed technique, the theoretical analysis proves that up to (p-1)/p data transmission cost can be saved. For general cases, the symmetric redistribution algorithm saves 1/p communication overheads compared with the traditional method. Experimental results also show that the proposed techniques provide superior performance in most data redistribution instances

The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations.

Destination Processors Research Articles

Articles published on Destination Processors

Training-Based Channel Estimation Algorithms for Dual Hop MIMO OFDM Relay Systems

A multiprocessor-oriented power-conscious scheduling algorithm for periodic tasks

Dynamic routing of data stream tuples among parallel query plan running on multi-core processors

An optimal scheduling algorithm for an agent-based multicast strategy on irregular networks

Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers

A fault-tolerant message passing algorithm and its hardware implementation

A generalized processor mapping technique for array redistribution

ON MESSAGE PACKAGING IN TASK SCHEDULING FOR DISTRIBUTED MEMORY PARALLEL MACHINES

THE VECTOR MULTIPROCESSOR

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Operation and performance of an ATM based demonstrator for the sequential option of the ATLAS trigger

Heterogeneous process migration: the Tui system

Application-level load migration and its implementation on top of PVM

Calculating controller area network (can) message response times

NIFDY

Calculating controller area network (CAN) message response times

Holistic schedulability analysis for distributed hard real-time systems

FTN topology and protocols

Using the dual path property of omega networks to obtain conflict-free message routing

The representation of multistage interconnection networks in queuing models of parallel systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Destination Processors Research Articles

Articles published on Destination Processors

Training-Based Channel Estimation Algorithms for Dual Hop MIMO OFDM Relay Systems

A multiprocessor-oriented power-conscious scheduling algorithm for periodic tasks

Dynamic routing of data stream tuples among parallel query plan running on multi-core processors

An optimal scheduling algorithm for an agent-based multicast strategy on irregular networks

Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers

A fault-tolerant message passing algorithm and its hardware implementation

A generalized processor mapping technique for array redistribution

ON MESSAGE PACKAGING IN TASK SCHEDULING FOR DISTRIBUTED MEMORY PARALLEL MACHINES

THE VECTOR MULTIPROCESSOR

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Operation and performance of an ATM based demonstrator for the sequential option of the ATLAS trigger

Heterogeneous process migration: the Tui system

Application-level load migration and its implementation on top of PVM

Calculating controller area network (can) message response times

NIFDY

Calculating controller area network (CAN) message response times

Holistic schedulability analysis for distributed hard real-time systems

FTN topology and protocols

Using the dual path property of omega networks to obtain conflict-free message routing

The representation of multistage interconnection networks in queuing models of parallel systems