linear speedup
Recently Published Documents


TOTAL DOCUMENTS

59
(FIVE YEARS 10)

H-INDEX

9
(FIVE YEARS 2)

Author(s):  
Moath Alsafasfeh ◽  
Bradely Bazuin ◽  
Ikhlas Abdel-Qader

Real-time inspections for the large-scale solar system may take a long time to get the hazard situations for any failures that may take place in the solar panels normal operations, where prior hazards detection is important. Reducing the execution time and improving the system’s performance are the ultimate goals of multiprocessing or multicore systems. Real-time video processing and analysis from two camcorders, thermal and charge-coupling devices (CCD), mounted on a drone compose the embedded system being proposed for solar panels inspection. The inspection method needs more time for capturing and processing the frames and detecting the faulty panels. The system can determine the longitude and latitude of the defect position information in real-time. In this work, we investigate parallel processing for the image processing operations which reduces the processing time for the inspection systems. The results show a super-linear speedup for real-time condition monitoring in large-scale solar systems. Using the multiprocessing module in Python, we execute fault detection algorithms using streamed frames from both video cameras. The experimental results show a super-linear speedup for thermal and CCD video processing, the execution time is efficiently reduced with an average of 3.1 times and 6.3 times using 2 processes and 4 processes respectively.


2021 ◽  
Vol 3 ◽  
Author(s):  
Yuepeng Li ◽  
Qiang Chen ◽  
Dave M. Kelly ◽  
Keqi Zhang

In this study, a parallel extension of the Coastal and Estuarine Storm Tide (CEST) model is developed and applied to simulate the storm surge tide at South Florida induced by hurricane Irma occurred in 2017. An improvement is also made to the existing advection algorithm in CEST. This is achieved through the introduction of high-order, monotone Semi-Lagrangian advection. Distributed memory parallelization is developed via the Message Passing Interface (MPI) library. The parallel CEST model can therefore be run efficiently on machines ranging from multicore laptops to massively High Performance Computing (HPC) system. The principle advantage of being able to run the CEST model on multiple cores is that relatively low run-time is possible for real world storm surge simulations on grids with high resolution, especially in the locality where the hurricane makes landfall. The computational time is critical for storm surge model forecast to finish simulations in 30 min, and results are available to users before the arrival of the next advisory. In this study, simulation of hurricane Irma induced storm surge was approximately 22 min for 4 day simulation, with the results validated by field measurements. Further efficiency analysis reveals that the parallel CEST model can achieve linear speedup when the number of processors is not very large.


Author(s):  
Diogo Marques ◽  
Aleksandar Ilic ◽  
Leonel Sousa

Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-23
Author(s):  
Saman Biookaghazadeh ◽  
Pravin Kumar Ravi ◽  
Ming Zhao

High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.


2019 ◽  
Vol 18 (5s) ◽  
pp. 1-23 ◽  
Author(s):  
Weiwen Jiang ◽  
Edwin H.-M. Sha ◽  
Xinyi Zhang ◽  
Lei Yang ◽  
Qingfeng Zhuge ◽  
...  
Keyword(s):  

Processes ◽  
2019 ◽  
Vol 7 (10) ◽  
pp. 640 ◽  
Author(s):  
Jagadish Torlapati ◽  
T. Prabhakar Clement

In this study, we present the details of an optimization method for parameter estimation of one-dimensional groundwater reactive transport problems using a parallel genetic algorithm (PGA). The performance of the PGA was tested with two problems that had published analytical solutions and two problems with published numerical solutions. The optimization model was provided with the published experimental results and reasonable bounds for the unknown kinetic reaction parameters as inputs. Benchmarking results indicate that the PGA estimated parameters that are close to the published parameters and it also predicted the observed trends well for all four problems. Also, OpenMP FORTRAN parallel constructs were used to demonstrate the speedup of the code on an Intel quad-core desktop computer. The parallel code showed a linear speedup with an increasing number of processors. Furthermore, the performance of the underlying optimization algorithm was tested to evaluate its sensitivity to the various genetic algorithm (GA) parameters, including initial population size, number of generations, and parameter bounds. The PGA used in this study is generic and can be easily scaled to higher-order water quality modeling problems involving real-world applications.


Author(s):  
Bin Gu ◽  
Wenhan Xian ◽  
Heng Huang

Asynchronous parallel stochastic optimization for non-convex  problems  becomes more and more   important in machine learning especially due to the popularity of deep learning. The Frank-Wolfe (a.k.a. conditional gradient) algorithms  has regained much interest  because of  its projection-free property and the ability of handling structured constraints. However,  our understanding of  asynchronous stochastic Frank-Wolfe algorithms is  extremely limited especially in the non-convex setting. To address this challenging problem, in this paper, we propose our  asynchronous stochastic  Frank-Wolfe algorithm (AsySFW) and  its variance reduction version (AsySVFW) for solving the constrained non-convex optimization problems.  More importantly, we  prove the fast convergence rates  of   AsySFW and AsySVFW in the non-convex setting. To the best of our knowledge, AsySFW and AsySVFW  are the first asynchronous parallel stochastic algorithms with convergence guarantees for solving the constrained  non-convex optimization problems. The  experimental  results on real high-dimensional gray-scale images   not only confirm the  fast convergence  of   our algorithms, but also show  a near-linear speedup  on a parallel system with shared memory due to the lock-free implementation.


Author(s):  
Amitava Datta ◽  
Amardeep Kaur ◽  
Tobias Lauer ◽  
Sami Chabbouh

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.


Sign in / Sign up

Export Citation Format

Share Document