linear speedup Latest Research Papers

Super-linear speedup for real-time condition monitoring using image processing and drones

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i2.pp1548-1557 ◽

2022 ◽

Vol 12 (2) ◽

pp. 1548

Author(s):

Moath Alsafasfeh ◽

Bradely Bazuin ◽

Ikhlas Abdel-Qader

Keyword(s):

Image Processing ◽

Real Time ◽

Condition Monitoring ◽

Video Processing ◽

Execution Time ◽

Large Scale ◽

Solar Panels ◽

Position Information ◽

Time Condition ◽

Linear Speedup

Real-time inspections for the large-scale solar system may take a long time to get the hazard situations for any failures that may take place in the solar panels normal operations, where prior hazards detection is important. Reducing the execution time and improving the system’s performance are the ultimate goals of multiprocessing or multicore systems. Real-time video processing and analysis from two camcorders, thermal and charge-coupling devices (CCD), mounted on a drone compose the embedded system being proposed for solar panels inspection. The inspection method needs more time for capturing and processing the frames and detecting the faulty panels. The system can determine the longitude and latitude of the defect position information in real-time. In this work, we investigate parallel processing for the image processing operations which reduces the processing time for the inspection systems. The results show a super-linear speedup for real-time condition monitoring in large-scale solar systems. Using the multiprocessing module in Python, we execute fault detection algorithms using streamed frames from both video cameras. The experimental results show a super-linear speedup for thermal and CCD video processing, the execution time is efficiently reduced with an average of 3.1 times and 6.3 times using 2 processes and 4 processes respectively.

Hurricane Irma Simulation at South Florida Using the Parallel CEST Model

Frontiers in Climate ◽

10.3389/fclim.2021.609688 ◽

2021 ◽

Vol 3 ◽

Author(s):

Yuepeng Li ◽

Qiang Chen ◽

Dave M. Kelly ◽

Keqi Zhang

Keyword(s):

Storm Surge ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Field Measurements ◽

South Florida ◽

Computational Time ◽

Linear Speedup ◽

Lagrangian Advection ◽

Hurricane Irma

In this study, a parallel extension of the Coastal and Estuarine Storm Tide (CEST) model is developed and applied to simulate the storm surge tide at South Florida induced by hurricane Irma occurred in 2017. An improvement is also made to the existing advection algorithm in CEST. This is achieved through the introduction of high-order, monotone Semi-Lagrangian advection. Distributed memory parallelization is developed via the Message Passing Interface (MPI) library. The parallel CEST model can therefore be run efficiently on machines ranging from multicore laptops to massively High Performance Computing (HPC) system. The principle advantage of being able to run the CEST model on multiple cores is that relatively low run-time is possible for real world storm surge simulations on grids with high resolution, especially in the locality where the hurricane makes landfall. The computational time is critical for storm surge model forecast to finish simulations in 30 min, and results are available to users before the arrival of the next advisory. In this study, simulation of hurricane Irma induced storm surge was approximately 22 min for 4 day simulation, with the results validated by field measurements. Further efficiency analysis reveals that the parallel CEST model can achieve linear speedup when the number of processors is not very large.

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3475866 ◽

2021 ◽

Vol 6 (2) ◽

pp. 1-23

Author(s):

Diogo Marques ◽

Aleksandar Ilic ◽

Leonel Sousa

Keyword(s):

Real World ◽

Upper Bounds ◽

Performance Bounds ◽

Realistic Modeling ◽

Linear Speedup ◽

Architectural Features ◽

Proposed Model ◽

Real World Applications ◽

Roofline Model

Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.

Toward Multi-FPGA Acceleration of the Neural Networks

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3432816 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-23

Author(s):

Saman Biookaghazadeh ◽

Pravin Kumar Ravi ◽

Ming Zhao

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

High Performance ◽

Low Latency ◽

Fpga Design ◽

Area Efficiency ◽

Linear Speedup ◽

The Neural Networks ◽

Fpga Acceleration

High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.

Fully asynchronous stochastic coordinate descent: a tight lower bound on the parallelism achieving linear speedup

Mathematical Programming ◽

10.1007/s10107-020-01552-8 ◽

2020 ◽

Author(s):

Yun Kuen Cheung ◽

Richard Cole ◽

Yixin Tao

Keyword(s):

Lower Bound ◽

Coordinate Descent ◽

Linear Speedup

Linear Speedup in Saddle-Point Escape for Decentralized Non-Convex Optimization

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9054327 ◽

2020 ◽

Author(s):

Stefan Vlaski ◽

Ali H. Sayed

Keyword(s):

Convex Optimization ◽

Saddle Point ◽

Linear Speedup

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

ACM Transactions on Embedded Computing Systems ◽

10.1145/3358192 ◽

2019 ◽

Vol 18 (5s) ◽

pp. 1-23 ◽

Cited By ~ 9

Author(s):

Weiwen Jiang ◽

Edwin H.-M. Sha ◽

Xinyi Zhang ◽

Lei Yang ◽

Qingfeng Zhuge ◽

...

Keyword(s):

Real Time ◽

Linear Speedup

Using Parallel Genetic Algorithms for Estimating Model Parameters in Complex Reactive Transport Problems

Processes ◽

10.3390/pr7100640 ◽

2019 ◽

Vol 7 (10) ◽

pp. 640 ◽

Cited By ~ 2

Author(s):

Jagadish Torlapati ◽

T. Prabhakar Clement

Keyword(s):

Genetic Algorithm ◽

Reactive Transport ◽

Numerical Solutions ◽

Optimization Method ◽

Water Quality Modeling ◽

Model Parameters ◽

Initial Population ◽

Desktop Computer ◽

Linear Speedup ◽

Transport Problems

In this study, we present the details of an optimization method for parameter estimation of one-dimensional groundwater reactive transport problems using a parallel genetic algorithm (PGA). The performance of the PGA was tested with two problems that had published analytical solutions and two problems with published numerical solutions. The optimization model was provided with the published experimental results and reasonable bounds for the unknown kinetic reaction parameters as inputs. Benchmarking results indicate that the PGA estimated parameters that are close to the published parameters and it also predicted the observed trends well for all four problems. Also, OpenMP FORTRAN parallel constructs were used to demonstrate the speedup of the code on an Intel quad-core desktop computer. The parallel code showed a linear speedup with an increasing number of processors. Furthermore, the performance of the underlying optimization algorithm was tested to evaluate its sensitivity to the various genetic algorithm (GA) parameters, including initial population size, number of generations, and parameter bounds. The PGA used in this study is generic and can be easily scaled to higher-order water quality modeling problems involving real-world applications.

Asynchronous Stochastic Frank-Wolfe Algorithms for Non-Convex Optimization

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/104 ◽

2019 ◽

Author(s):

Bin Gu ◽

Wenhan Xian ◽

Heng Huang

Keyword(s):

Convex Optimization ◽

Variance Reduction ◽

Convergence Rates ◽

Optimization Problems ◽

Fast Convergence ◽

Convex Optimization Problems ◽

Gradient Algorithms ◽

Linear Speedup ◽

Convex Setting ◽

Asynchronous Parallel

Asynchronous parallel stochastic optimization for non-convex problems becomes more and more important in machine learning especially due to the popularity of deep learning. The Frank-Wolfe (a.k.a. conditional gradient) algorithms has regained much interest because of its projection-free property and the ability of handling structured constraints. However, our understanding of asynchronous stochastic Frank-Wolfe algorithms is extremely limited especially in the non-convex setting. To address this challenging problem, in this paper, we propose our asynchronous stochastic Frank-Wolfe algorithm (AsySFW) and its variance reduction version (AsySVFW) for solving the constrained non-convex optimization problems. More importantly, we prove the fast convergence rates of AsySFW and AsySVFW in the non-convex setting. To the best of our knowledge, AsySFW and AsySVFW are the first asynchronous parallel stochastic algorithms with convergence guarantees for solving the constrained non-convex optimization problems. The experimental results on real high-dimensional gray-scale images not only confirm the fast convergence of our algorithms, but also show a near-linear speedup on a parallel system with shared memory due to the lock-free implementation.

Exploiting multi–core and many–core parallelism for subspace clustering

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0006 ◽

2019 ◽

Vol 29 (1) ◽

pp. 81-91

Author(s):

Amitava Datta ◽

Amardeep Kaur ◽

Tobias Lauer ◽

Sami Chabbouh

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Subspace Clustering ◽

Research Problem ◽

Fine Grained ◽

Linear Speedup ◽

Many Core ◽

Graphics Processing ◽

Gpu Implementation

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.

linear speedup
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Super-linear speedup for real-time condition monitoring using image processing and drones

Hurricane Irma Simulation at South Florida Using the Parallel CEST Model

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

Toward Multi-FPGA Acceleration of the Neural Networks

Fully asynchronous stochastic coordinate descent: a tight lower bound on the parallelism achieving linear speedup

Linear Speedup in Saddle-Point Escape for Decentralized Non-Convex Optimization

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Using Parallel Genetic Algorithms for Estimating Model Parameters in Complex Reactive Transport Problems

Asynchronous Stochastic Frank-Wolfe Algorithms for Non-Convex Optimization

Exploiting multi–core and many–core parallelism for subspace clustering

Export Citation Format

linear speedupRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Super-linear speedup for real-time condition monitoring using image processing and drones

Hurricane Irma Simulation at South Florida Using the Parallel CEST Model

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs

Toward Multi-FPGA Acceleration of the Neural Networks

Fully asynchronous stochastic coordinate descent: a tight lower bound on the parallelism achieving linear speedup

Linear Speedup in Saddle-Point Escape for Decentralized Non-Convex Optimization

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Using Parallel Genetic Algorithms for Estimating Model Parameters in Complex Reactive Transport Problems

Asynchronous Stochastic Frank-Wolfe Algorithms for Non-Convex Optimization

Exploiting multi–core and many–core parallelism for subspace clustering

linear speedup
Recently Published Documents