Topology-Aware Multi-cluster Architecture Based on Efficient Index Techniques

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3460122 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-22

Author(s):

Jerzy Proficz

Keyword(s):

Experimental Evaluation ◽

Data Exchange ◽

State Of The Art ◽

Monitoring And Evaluation ◽

The Other ◽

Early Data ◽

Cluster Architecture ◽

Novel Algorithms

Two novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI implementations and exploits an auxiliary background thread for early data exchange from faster processes to accelerate the performed all-gather operation. The other algorithm, Background Sorted Linear synchronized tree with Broadcast (BSLB), is built upon the already existing PAP-aware gather algorithm, that is, Background Sorted Linear Synchronized tree (BSLS), followed by a regular broadcast distributing gathered data to all participating processes. The background of the imbalanced PAP subject is described, along with the PAP monitoring and evaluation topics. An experimental evaluation of the algorithms based on a proposed mini-benchmark is presented. The mini-benchmark was performed over 2,000 times in a typical HPC cluster architecture with homogeneous compute nodes. The obtained results are analyzed according to different PATs, data sizes, and process numbers, showing that the proposed optimization works well for various configurations, is scalable, and can significantly reduce the all-gather elapsed times, in our case, up to factor 1.9 or 47% in comparison with the best state-of-the-art solution.

Download Full-text

Binary Precision Neural Network Manycore Accelerator

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3423136 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-27

Author(s):

Morteza Hosseini ◽

Tinoosh Mohsenin

Keyword(s):

Neural Network ◽

Low Power ◽

Image Classification ◽

Case Studies ◽

Average Power ◽

Total Power ◽

Fabrication Technology ◽

Population Count ◽

Cluster Architecture ◽

Domain Specific

This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.

Download Full-text

Energy efficient fixed-cluster architecture for wireless sensor networks

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-192177 ◽

2021 ◽

Vol 40 (5) ◽

pp. 8727-8740

Author(s):

Rajvir Singh ◽

C. Rama Krishna ◽

Rajnish Sharma ◽

Renu Vig

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Data Aggregation ◽

Energy Efficient ◽

Search Algorithm ◽

Cluster Formation ◽

Wireless Sensor ◽

Efficient Operation ◽

Cluster Architecture ◽

Dynamic Cluster

Dynamic and frequent re-clustering of nodes along with data aggregation is used to achieve energy-efficient operation in wireless sensor networks. But dynamic cluster formation supports data aggregation only when clusters can be formed using any set of nodes that lie in close proximity to each other. Frequent re-clustering makes network management difficult and adversely affects the use of energy efficient TDMA-based scheduling for data collection within the clusters. To circumvent these issues, a centralized Fixed-Cluster Architecture (FCA) has been proposed in this paper. The proposed scheme leads to a simplified network implementation for smart spaces where it makes more sense to aggregate data that belongs to a cluster of sensors located within the confines of a designated area. A comparative study is done with dynamic clusters formed with a distributive Low Energy Adaptive Clustering Hierarchy (LEACH) and a centralized Harmonic Search Algorithm (HSA). Using uniform cluster size for FCA, the results show that it utilizes the available energy efficiently by providing stability period values that are 56% and 41% more as compared to LEACH and HSA respectively.

Download Full-text

PERFORMANCE ANALYSIS OF THE SUPERCOMPUTER BASED ON RASPBERRY PI NODES

Journal of Military Science and Technology ◽

10.54939/1859-1043.j.mst.72a.2021.76-86 ◽

2021 ◽

pp. 76-86

Author(s):

Hai

Keyword(s):

Performance Analysis ◽

Energy Efficient ◽

Multicore Processors ◽

Electrical Power ◽

Experimental Results ◽

Raspberry Pi ◽

Cooling Towers ◽

Cluster Architecture ◽

Cooling Technology ◽

Almost All

In this paper, a new Raspberry PI supercomputer cluster architecture is proposed. Generally, to gain speed at petaflops and exaflops, typical modern supercomputers based on 2009-2018 computing technologies must consume between 6 MW and 20 MW of electrical power, almost all of which is converted into heat, requiring high cost for cooling technology and Cooling Towers. The management of heat density has remained a key issue for most centralized supercomputers. In our proposed architecture, supercomputers with highly energy-efficient mobile ARM processors are a new choice as it enables them to address performance, power, and cost issues. With ARM’s recent introduction of its energy-efficient 64-bit CPUs targeting servers, Raspberry Pi cluster module-based supercomputing is now within reach. But how is the performance of supercomputers-based mobile multicore processors? Obtained experimental results reported on the proposed approach indicate the lower electrical power and higher performance in comparison with the previous approaches.

Download Full-text

Big Data Cluster Architecture

SQL Server Big Data Clusters ◽

10.1007/978-1-4842-5985-6_2 ◽

2020 ◽

pp. 11-32

Author(s):

Benjamin Weissman ◽

Enrico van de Laar

Keyword(s):

Big Data ◽

Cluster Architecture

Download Full-text

Identification of co-located QTLs and genomic regions affecting grapevine cluster architecture

Theoretical and Applied Genetics ◽

10.1007/s00122-018-3269-1 ◽

2018 ◽

Vol 132 (4) ◽

pp. 1159-1177 ◽

Cited By ~ 10

Author(s):

Robert Richter ◽

Doreen Gabriel ◽

Florian Rist ◽

Reinhard Töpfer ◽

Eva Zyprian

Keyword(s):

Cluster Architecture ◽

Genomic Regions

Download Full-text

On Construction of a Diskless Cluster Computing Environment in a Computer Classroom

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012100105 ◽

2012 ◽

Vol 4 (4) ◽

pp. 68-88

Author(s):

Chao-Tung Yang ◽

Wen-Feng Hsieh

Keyword(s):

High Performance ◽

Cluster Computing ◽

Relevant Information ◽

Computing Environment ◽

Cluster Architecture ◽

Computer Classroom ◽

Computation Node ◽

Cluster Environment ◽

Computing Performance ◽

Performance Computing

This paper’s objective is to implement and evaluate a high-performance computing environment by clustering idle PCs (personal computers) with diskless slave nodes on campuses to obtain the effectiveness of the largest computer potency. Two sets of Cluster platforms, BCCD and DRBL, are used to compare computing performance. It’s to prove that DRBL has better performance than BCCD in this experiment. Originally, DRBL was created to facilitate instructions for a Free Software Teaching platform. In order to achieve the purpose, DRBL is applied to the computer classroom with 32 PCs so to enable PCs to be switched manually or automatically among different OS (operating systems). The bioinformatics program, mpiBLAST, is executed smoothly in the Cluster architecture as well. From management’s view, the state of each Computation Node in Clusters is monitored by “Ganglia”, an existing Open Source. The authors gather the relevant information of CPU, Memory, and Network Load for each Computation Node in every network section. Through comparing aspects of performance, including performance of Swap and different network environment, they attempted to find out the best Cluster environment in a computer classroom at the school. Finally, HPL of HPCC is used to demonstrate cluster performance.

Download Full-text