Parallel Processing Letters

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Parallel Processing Letters ◽

10.1142/s0129626421500262 ◽

2021 ◽

Author(s):

Evgeny Eremin

Keyword(s):

Parallel Processing ◽

Simple Formula ◽

Conventional Form ◽

Parallel Processes ◽

Amdahl’S Law ◽

Parallel Performance ◽

Start Up ◽

Two Factors ◽

Optimistic Picture

The conventional form of Amdahl’s law states that speedup of calculations in a multiprocessor machine is limited by the definite constant value just due to the existence of some non-parallelizable part in any algorithm. This brief paper considers one more general reason, which prevents a growth of parallel performance: processes that implement distributed task cannot start simultaneously and hence every process adds some start-up time, also reducing by that the gain from a parallel processing. The simple formula, proposed here to extend Amdahl’s law, leads to a less optimistic picture in comparison with classical results: for large amount of processor units the modified law does not approach to constant but vanishes. This is the result of competition between two factors: decreasing of calculation duty and increasing of start-up time when a number of parallel processes grows. The effect may be subdued by means of specific regularity in launching parallel processes.

Download Full-text

Effects of Continuous vs Discrete Frequency Scaling and Core Allocation on Energy Efficiency of Static Schedules for Moldable Tasks

Parallel Processing Letters ◽

10.1142/s0129626421500250 ◽

2021 ◽

Author(s):

Sebastian Litzinger ◽

Jörg Keller

Keyword(s):

Energy Efficiency ◽

Parallel Machines ◽

Frequency Scaling ◽

Task Execution ◽

Core Allocation ◽

Time Frequency ◽

Discrete Frequency ◽

Continuous Frequency ◽

Time And Energy ◽

Task Sets

Models for energy-efficient static scheduling of parallelizable tasks with deadlines on frequency-scalable parallel machines comprise moldable vs. malleable tasks and continuous vs. discrete frequency levels, plus preemptive vs. non-preemptive task execution with or without task migration. We investigate the tradeoff between scheduling time and energy efficiency when going from continuous to discrete core allocation and frequency levels on a multicore processor, and from preemptive to non-preemptive task execution. To this end, we present a tool to convert a schedule computed for malleable tasks on machines with continuous frequency scaling [Sanders and Speck, Euro-Par (2012)] into one for moldable tasks on a machine with discrete frequency levels. We compare the energy efficiency of the converted schedule to the energy consumed by a schedule produced by the integrated crown scheduler [Melot et al., ACM TACO (2015)] for moldable tasks and a machine with discrete frequency levels. Our experiments with synthetic and application-based task sets indicate that the converted Sanders Speck schedules, while computed faster, consume more energy on average than crown schedules. Surprisingly, it is not the step from malleable to moldable tasks that is responsible but the step from continuous to discrete frequency levels. One-time frequency scaling during a task’s execution can compensate for most of the energy overhead caused by frequency discretization.

Download Full-text

Graph Connectivity in Log Steps Using Label Propagation

Parallel Processing Letters ◽

10.1142/s0129626421500213 ◽

2021 ◽

Author(s):

Paul Burkhardt

Keyword(s):

Computational Models ◽

Undirected Graph ◽

Random Access ◽

Memory Systems ◽

Label Propagation ◽

Graph Connectivity ◽

Connected Components ◽

Original Graph ◽

Path Graph ◽

Parallel Random Access Machine

The fastest deterministic algorithms for connected components take logarithmic time and perform superlinear work on a Parallel Random Access Machine (PRAM). These algorithms maintain a spanning forest by merging and compressing trees, which requires pointer-chasing operations that increase memory access latency and are limited to shared-memory systems. Many of these PRAM algorithms are also very complicated to implement. Another popular method is “leader-contraction” where the challenge is to select a constant fraction of leaders that are adjacent to a constant fraction of non-leaders with high probability, but this can require adding more edges than were in the original graph. Instead we investigate label propagation because it is deterministic, easy to implement, and does not rely on pointer-chasing. Label propagation exchanges representative labels within a component using simple graph traversal, but it is inherently difficult to complete in a sublinear number of steps. We are able to overcome the problems with label propagation for graph connectivity. We introduce a surprisingly simple framework for deterministic, undirected graph connectivity using label propagation that is easily adaptable to many computational models. It achieves logarithmic convergence independently of the number of processors and without increasing the edge count. We employ a novel method of propagating directed edges in alternating direction while performing minimum reduction on vertex labels. We present new algorithms in PRAM, Stream, and MapReduce. Given a simple, undirected graph [Formula: see text] with [Formula: see text] vertices, [Formula: see text] edges, our approach takes O(m) work each step, but we can only prove logarithmic convergence on a path graph. It was conjectured by Liu and Tarjan (2019) to take [Formula: see text] steps or possibly [Formula: see text] steps. Our experiments on a range of difficult graphs also suggest logarithmic convergence. We leave the proof of convergence as an open problem.

Download Full-text

Author Index Volume 31

Parallel Processing Letters ◽

10.1142/s0129626421990012 ◽

2021 ◽

Vol 31 (04) ◽

Keyword(s):

Index Volume

Download Full-text

FPGA-based Neural Net for Failures Prediction in the Cold Forging Process

Parallel Processing Letters ◽

10.1142/s0129626421500237 ◽

2021 ◽

Author(s):

Grzegorz Rafał Dec

Keyword(s):

Neural Network ◽

Test Data ◽

Classification Accuracy ◽

Deep Neural Network ◽

Failure Prediction ◽

Cold Forging ◽

Dense Layer ◽

Neural Net ◽

Desktop Computer ◽

Forging Process

This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FPGA. The network was trained beforehand on Desktop Computer using Keras library for Python and the weights and the biases were embedded into the implementation. The implementation is executed using the DSP blocks, available via Vivado Design Suite, which are in compliance with the IEEE754 standard. The simulation of the network achieves 100% classification accuracy on the test data and high calculation speed.

Download Full-text

Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations

Parallel Processing Letters ◽

10.1142/s0129626421500195 ◽

2021 ◽

Author(s):

Olfa Hamdi-Larbi ◽

Ichrak Mehrez ◽

Thomas Dufaud

Keyword(s):

Machine Learning ◽

Numerical Method ◽

High Performance ◽

Programming Model ◽

Learning Algorithm ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Compression ◽

Target Architecture ◽

Parallel Programming Model

Many applications in scientific computing process very large sparse matrices on parallel architectures. The presented work in this paper is a part of a project where our general aim is to develop an auto-tuner system for the selection of the best matrix compression format in the context of high-performance computing. The target smart system can automatically select the best compression format for a given sparse matrix, a numerical method processing this matrix, a parallel programming model and a target architecture. Hence, this paper describes the design and implementation of the proposed concept. We consider a case study consisting of a numerical method reduced to the sparse matrix vector product (SpMV), some compression formats, the data parallel as a programming model and, a distributed multi-core platform as a target architecture. This study allows extracting a set of important novel metrics and parameters which are relative to the considered programming model. Our metrics are used as input to a machine-learning algorithm to predict the best matrix compression format. An experimental study targeting a distributed multi-core platform and processing random and real-world matrices shows that our system can improve in average up to 7% the accuracy of the machine learning.

Download Full-text

Reliability Evaluation of Bicube-Based Multiprocessor System under the g-Good-Neighbor Restriction

Parallel Processing Letters ◽

10.1142/s0129626421500183 ◽

2021 ◽

Author(s):

Jiafei Liu ◽

Shuming Zhou ◽

Eddie Cheng ◽

Gaolin Chen ◽

Min Li

Keyword(s):

Cloud Computing ◽

Reliability Evaluation ◽

Big Data Analysis ◽

System Level ◽

Multiprocessor System ◽

Multiprocessor Systems ◽

Novel Variant ◽

Neighbor Connectivity ◽

Good Neighbor ◽

Primary Strategy

Multiprocessor systems are commonly deployed for big data analysis because of evolution in technologies such as cloud computing, IoT, social network and so on. Reliability evaluation is of significant importance for maintenance and improvement of fault tolerance for multiprocessor systems, and system-level diagnosis is a primary strategy to identify the faulty processors in the systems. In this paper, we first determine the [Formula: see text]-good-neighbor connectivity of the [Formula: see text]-dimensional Bicube-based multiprocessor system [Formula: see text], a novel variant of hypercube. Besides, we establish the [Formula: see text]-good-neighbor diagnosability of the Bicube-based multiprocessor system [Formula: see text] under the PMC and MM* models.

Download Full-text

Beyond Rings: Gathering in 1-Interval Connected Graphs

Parallel Processing Letters ◽

10.1142/s0129626421500201 ◽

2021 ◽

Author(s):

Othon Michail ◽

Paul G. Spirakis ◽

Michail Theofilatos

Keyword(s):

Deterministic Algorithm ◽

Dynamic Graphs ◽

Connected Graphs ◽

Underlying Graph ◽

Spanning Subgraphs ◽

Connectivity Model ◽

Connectivity Property ◽

Multi Agent ◽

Initial Agent ◽

Explicit Communication

We examine the problem of gathering [Formula: see text] agents (or multi-agent rendezvous) in dynamic graphs which may change in every round. We consider a variant of the [Formula: see text]-interval connectivity model [9] in which all instances (snapshots) are always connected spanning subgraphs of an underlying graph, not necessarily a clique. The agents are identical and not equipped with explicit communication capabilities, and are initially arbitrarily positioned on the graph. The problem is for the agents to gather at the same node, not fixed in advance. We first show that the problem becomes impossible to solve if the underlying graph has a cycle. In light of this, we study a relaxed version of this problem, called weak gathering, where the agents are allowed to gather either at the same node, or at two adjacent nodes. Our goal is to characterize the class of 1-interval connected graphs and initial configurations in which the problem is solvable, both with and without homebases. On the negative side we show that when the underlying graph contains a spanning bicyclic subgraph and satisfies an additional connectivity property, weak gathering is unsolvable, thus we concentrate mainly on unicyclic graphs. As we show, in most instances of initial agent configurations, the agents must meet on the cycle. This adds an additional difficulty to the problem, as they need to explore the graph and recognize the nodes that form the cycle. We provide a deterministic algorithm for the solvable cases of this problem that runs in [Formula: see text] number of rounds.

Download Full-text

Abnormal Quantum State Search Based on Parallel Phase Comparison

Parallel Processing Letters ◽

10.1142/s0129626421500225 ◽

2021 ◽

Author(s):

Guanlei Xu ◽

Xiaogang Xu ◽

Xiaotong Wang

Keyword(s):

Quantum State ◽

Search Algorithm ◽

Quantum States ◽

Quantum Search ◽

Superposition State ◽

Quantum Operator ◽

Abnormal State ◽

Number Formula ◽

Speed Up ◽

Grover Search

We discuss the problem of filtering out abnormal states from a larger number of quantum states. For this type of problem with [Formula: see text] items to be searched, both the traditional search by enumeration and classical Grover search algorithm have the complexity about [Formula: see text]. In this letter a novel quantum search scheme with exponential speed up is proposed for abnormal states. First, a new comprehensive quantum operator is well-designed to extract the superposition state containing all abnormal states with unknown number [Formula: see text] with complexity [Formula: see text] in probability 1 via well-designed parallel phase comparison. Then, every abnormal state is achieved respectively from [Formula: see text] abnormal states via [Formula: see text] times’ measurement. Finally, a numerical example is given to show the efficiency of the proposed scheme.

Download Full-text

Design and Implementation of Low Power and Area Efficient Architecture for High Performance ALU

Parallel Processing Letters ◽

10.1142/s0129626421500171 ◽

2021 ◽

Author(s):

Usthulamuri Penchalaiah ◽

V. G. Siva Kumar

Keyword(s):

Digital Signal Processor ◽

High Performance ◽

Architectural Design ◽

Digital Signal ◽

Building Blocks ◽

Delay Performance ◽

Art Research ◽

Comparative Results ◽

Almost All ◽

Signal Processor

Digital Signal Processors (DSP) have a ubiquitous presence in almost all civil and military signal processing applications, including mission critical environments like nuclear reactors, process control etc. Arithmetic and Logic units (ALU), being the heart of any digital signal processor, play critical and decisive roles in achieving the required parameter benchmarks and the overall efficiency and robustness of the digital signal processor. State of the art research has shown successful traction with the performance requirements of critical Multiply-Accumulate (MAC) parameters, like reduced power consumption, small electronic real estate footprint and reduction in delay with the associated design complexity. Judicious placement of its building blocks, namely, the truncated multiplier and half-sum carry generation-sum carry generation (HSCG-SCG) adder in the architectural design of ALU and the type of adder and multiplier circuits selected are the core decisions that decide the overall performance of the ALU. To overcome the drawback and to improve the performance further, this work proposes a new architecture for the square root (SQRT) carry select adder (CSLA) using half-sum generation (HSG), half-carry generation (HCG), full-sum generation (FSG) and full-carry generation (FCG) blocks. The proposed design contains N-bit architecture, and comparative results are considered for 8-bit, 16-bit and 32-bit combinations. All the designs are implemented in the Xilinx ISE environment and the results show that better area, power, and delay performance compared to the state of art methods.

Download Full-text

Parallel Processing Letters
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By World Scientific

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Effects of Continuous vs Discrete Frequency Scaling and Core Allocation on Energy Efficiency of Static Schedules for Moldable Tasks

Graph Connectivity in Log Steps Using Label Propagation

Author Index Volume 31

FPGA-based Neural Net for Failures Prediction in the Cold Forging Process

Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations

Reliability Evaluation of Bicube-Based Multiprocessor System under the g-Good-Neighbor Restriction

Beyond Rings: Gathering in 1-Interval Connected Graphs

Abnormal Quantum State Search Based on Parallel Phase Comparison

Design and Implementation of Low Power and Area Efficient Architecture for High Performance ALU

Export Citation Format

Parallel Processing LettersLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By World Scientific

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Effects of Continuous vs Discrete Frequency Scaling and Core Allocation on Energy Efficiency of Static Schedules for Moldable Tasks

Graph Connectivity in Log Steps Using Label Propagation

Author Index Volume 31

FPGA-based Neural Net for Failures Prediction in the Cold Forging Process

Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations

Reliability Evaluation of Bicube-Based Multiprocessor System under the g-Good-Neighbor Restriction

Beyond Rings: Gathering in 1-Interval Connected Graphs

Abnormal Quantum State Search Based on Parallel Phase Comparison

Design and Implementation of Low Power and Area Efficient Architecture for High Performance ALU

Parallel Processing Letters
Latest Publications