Fat-tree Network Research Articles

The constant advances in IC technologies have introduced new challenges for implementations and design methodologies; higher integration level allows more complex systems to be implemented but on the other hand implementations often have strict constraints on power consumption. These challenges are present in signal processing systems implying the need to improve design methods and find more efficient algorithm-architecture optimizations. This special issue contains a selection of recent papers on design and implementation of signal processing systems ranging from circuit level architectures to scheduling methods and from application-specific architectures to implementations on many-core systems. In Data Center Switch for Load Balanced Fat-Trees, Lai, and Chiu demonstrate a fault tolerant switch IC operating at the maximum rate of 5.8 Gbps per channel. This work employs a load-balanced fat-tree architecture that does not consume all of its bandwidth even under heavy traffic. When there are broken links or faulty switches in the network even in heavy traffic load situations, available bandwidth remains in every connection pattern and alternative paths are provided to re-route the traffic. Fault tolerance capability evaluations of link or switch faults in the fattree network are given to support the presented idea, and a 4×4 Banyan type switch IC is developed as the commodity switch for building the fault tolerant fat-tree data center networks. Lee and Sung propose a cell-to-cell interference (CCI) cancellation technique for multi-level NAND flash memory in their paper Least Squares Based Coupling Cancellation for MLC NAND Flash Memory with a Small Number of Voltage Sensing Operations. Their two-step algorithm consists of training and then interference removal performed during the page read operation. A least-squares adaptive CCI canceller is developed and optimal quantization schemes are studied. Experimental results show a significant BER improvement despite a low number of voltage sensing operations. In A Fast Recursive Algorithm and Architecture for Pruned Bit-Reversal Interleavers, Mansour describes an algorithm and architecture for implementing interleavers used in communications applications such as errorcorrecting codes (turbo codes) and bit-interleaved coded modulation. A mathematical formulation for developing flexible-length interleavers is developed along with a study of permutation statistics. Practical examples of implementations of parallel interleavers are provided. In Highly Parallelable Bidimensional Median Filter for Modern Parallel Programming Models, Sanchez and Rodriguez present efficient parallel implementation methods for median filtering. The authors implement their previous work on the parallel ccdf-based median filter (PCMF) on a GPU (Graphics Processing Unit), and show that the proposed median filtering algorithm is efficient and can outperform other generic median filters for the GPU. The proposed algorithm is implemented in three parallel programming models: SIMD Intel, multi-core Intel with SIMD, and SIMT (CUDA). Additionally they make use of the salt & pepper noise model to improve the image reconstruction quality with a small performance impact. J. Takala (*) Department of Pervasive Computing, Tampere University of Technology, Tampere, Finland e-mail: jarmo.takala@tut.fi

Read full abstract

The poor scalability of existing superscalar processors has been of great concern to the computer engineering community. In particular, the critical-path lengths of many components in existing implementations grow as Θ(n 2 ) where n is the fetch width, the issue width, or the window size. This paper describes two scalable processor architectures, Ultrascalar I and Ultrascalar II, and compares their VLSI complexities (gate delays, wire-length delays, and area.) Both processors are implemented by a large collection of ALUs with controllers (together called execution stations ) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved cache to the execution stations. These networks provide the full functionality of superscalar processors including renaming, out-of-order execution, and speculative execution. The difference between the processors is in the mechanism used to transmit register values from one execution station to another. Both architectures use a parallel-prefix tree to communicate the register values between the execution stations. Ultrascalar I transmits an entire copy of the register file to each station, and the station chooses which register values it needs based on the instruction. Ultrascalar I uses an H-tree layout. Ultrascalar II uses a mesh-of-trees and carefully sends only the register values that will actually be needed by each subtree to reduce the number of wires required on the chip. The complexity results are as follows: The complexity is described for a processor which has an instruction-set architecture containing L logical registers and can execute n instructions in parallel. The chip provides enough memory bandwidth to execute up to M(n) memory operations per cycle. (M is assumed to have a certain regularity property.) In all the processors, the VLSI area is the square of the wire delay. Ultrascalar I has gate delay O(log n) and wire delay \tauwires = \Theta(\sqrt{n}L) if $M(n)$ is $O(n^{1/2-\varepsilon})$, \tauwires = \Theta(\sqrt{n}(L+\log n)) if $M(n)$ is $\Theta(n^{1/2})$, \tauwires = \Theta(\sqrt{n}L+M(n)) if $M(n)$ is $\Omega(n^{1/2+\varepsilon})$ for ɛ>0 . Ultrascalar II has gate delay Θ(log L+log n) . The wire delay is Θ(n) , which is optimal for n=O(L) . Thus, Ultrascalar II dominates Ultrascalar I for n=O(L 2 ) , otherwise Ultrascalar I dominates Ultrascalar II. We introduce a hybrid ultrascalar that uses a two-level layout scheme: Clusters of execution stations are layed out using the Ultrascalar II mesh-of-trees layout, and then the clusters are connected together using the H-tree layout of Ultrascalar I. For the hybrid (in which n≥ L ), the wire delay is Θ(\sqrt nL+M(n)) , which is optimal. For n≥ L , the hybrid dominates both Ultrascalar I and Ultrascalar II. We also present an empirical comparison of Ultrascalar I and the hybrid, both layed out using the Magic VLSI editor. For a processor that has 32 32-bit registers and a simple integer ALU, the hybrid requires about 11 times less area.

Read full abstract

Fat-tree Network Research Articles

Related Topics

Articles published on Fat-tree Network

A parametric-based performance evaluation and design trade-offs for interconnect architectures using FPGAs for networks-on-chip

Mixed-grained CMOS field programmable analog array for smart sensory applications

The importance of switch dimension for energy-efficient datacenter design

Switch sizing for energy-efficient datacenter networks

Evaluating SDN based Rack-to-Rack Multi-path Switching for Data-center Networks

Fast pattern-specific routing for fat tree networks

Fast pattern-specific routing for fat tree networks

Guest Editors’ Introduction to Special Issue on Advances in DSP System Design

Data Center Switch for Load Balanced Fat-Trees

MODELING THE PERFORMANCE OF DIRECT NUMERICAL SIMULATION ON PARALLEL SYSTEMS

Efficient and Scalable Hardware-Based Multicast in Fat-Tree Networks

A collision model for randomized routing in fat-tree networks

A Comparison of Asymptotically Scalable Superscalar Processors

Optimal High-Performance Parallel Text Retrieval via Fat-Trees

Design and performance analysis of the Practical Fat Tree Network using a butterfly network

Universal wormhole routing

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Fat-tree Network Research Articles

Related Topics

Articles published on Fat-tree Network

A parametric-based performance evaluation and design trade-offs for interconnect architectures using FPGAs for networks-on-chip

Mixed-grained CMOS field programmable analog array for smart sensory applications

The importance of switch dimension for energy-efficient datacenter design

Switch sizing for energy-efficient datacenter networks

Evaluating SDN based Rack-to-Rack Multi-path Switching for Data-center Networks

Fast pattern-specific routing for fat tree networks

Fast pattern-specific routing for fat tree networks

Guest Editors’ Introduction to Special Issue on Advances in DSP System Design

Data Center Switch for Load Balanced Fat-Trees

MODELING THE PERFORMANCE OF DIRECT NUMERICAL SIMULATION ON PARALLEL SYSTEMS

Efficient and Scalable Hardware-Based Multicast in Fat-Tree Networks

A collision model for randomized routing in fat-tree networks

A Comparison of Asymptotically Scalable Superscalar Processors

Optimal High-Performance Parallel Text Retrieval via Fat-Trees

Design and performance analysis of the Practical Fat Tree Network using a butterfly network

Universal wormhole routing