Abstract

This paper proposes a distributed and scalable hardware solution for efficient barrier synchronization management on many-core Network-on-Chips (NoCs). It includes two hardware modules, named Root Distributed and Scalable Barrier Synchronizer (Root DSBS) and Leaf Distributed and Scalable Barrier Synchronizer (Leaf DSBS). The Root DSBS is located in the central node, connecting to the processor core and the network interface. It provides a set of globally addressed barrier counters, sets the barrier and counts arriving "barrier acquire" requests, and releases the barrier and sends out "barrier release" acknowledgements once the barrier condition is satisfied. The Leaf DSBS is integrated into each router in the on-chip network. It is responsible for efficiently transmitting barrier synchronization related packets in the on-chip network to the Root DSBS. The Root DSBS in the central node and all Leaf DSBSs in routers cooperate together to accomplish barrier synchronization. Our solution has two salient features. One is called "Unicast Merging" - "barrier acquire" packets towards the same barrier are merged into one packet when they pass through the same router simultaneously. The purpose is to minimize the completion time of barrier acquiring by reducing the number of barrier synchronization related packets. The other is called "Broadcasting" - a "barrier release" packet is broadcasted to all synchronized nodes. Its object is to reduce area cost by avoiding storing synchronized node numbers as well as to minimize the completion time of barrier releasing by avoiding sending unicast "barrier release" packets. To evaluate the performance, we investigate hardware cost and employ both synthetic and application experiments. Synthesis and experiment results show that our distributed and scalable barrier synchronization obtains both area and performance advantage over the conventional barrier synchronization counterpart. The Root DSBS and Leaf DSBSs can run over 2GHz in TSMC <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">®</sup> 65nm technology with small area overhead. Our solution only costs a little completion time and generates well distributed and uniform network traffic. When the network size is 16×16, the application's performance improvement can achieve 24.60%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.