Exploiting storage redundancy to speed up randomized shared memory simulations

Background: Magnetic resonance (MR) imaging plays a significant role in the computer aided diagnostic systems for remote healthcare. In such systems, the soft textures and tissues within the denoised MR image are classified by the segmentation stage using machine learning algorithms like Hidden Markov Model. Thus, quality of MR image is of extreme importance and is decisive in accuracy of process of classification and diagnosis. Objective: To provide real time medical diagnostics in the remote healthcare intelligent setups, the research work proposes CUDA GPU based accelerated bilateral filter for fast denoising of 2D high resolution knee MR images. Method: To achieve optimized GPU performance with better speed up, the work implements an improvised technique that uses on chip shared memory in combination with constant cache. Results: The speed up of 382x is achieved with the new proposed optimization technique which is 2.7x as that obtained with the shared memory only approach. The superior speed up is along with 90.6%occupancy index indicating effective parallelization. The work here also aims at justifying appropriateness of bilateral filter over other filters for denoising magnetic resonance images. All the patents related to GPU based image denoising are revised and uniqueness of the proposed technique is confirmed. Conclusion: The results indicate that even for a 64Mpixel image, the execution time of the proposed implementation is 334.91 msec only, making the performance almost real time. This will surely contribute to the real time computer aided data diagnostics requirement under remote critical conditions.

Download Full-text

Exploiting storage redundancy to speed up randomized shared memory simulations

Theoretical Computer Science ◽

10.1016/0304-3975(96)00032-1 ◽

1996 ◽

Vol 162 (2) ◽

pp. 245-281 ◽

Cited By ~ 16

Author(s):

Friedhelm Meyer auf der Heide ◽

Christian Scheideler ◽

Volker Stemann

Keyword(s):

Shared Memory ◽

Speed Up

Download Full-text

A review of literature on parallel constraint solving

Theory and Practice of Logic Programming ◽

10.1017/s1471068418000340 ◽

2018 ◽

Vol 18 (5-6) ◽

pp. 725-758 ◽

Cited By ~ 6

Author(s):

IAN P. GENT ◽

IAN MIGUEL ◽

PETER NIGHTINGALE ◽

CIARAN MCCREESH ◽

PATRICK PROSSER ◽

...

Keyword(s):

General Solution ◽

Shared Memory ◽

Distributed Computation ◽

Constraint Solving ◽

Review Of Literature ◽

Speed Up ◽

Dynamic Decomposition

AbstractAs multi-core computing is now standard, it seems irresponsible for constraints researchers to ignore the implications of it. Researchers need to address a number of issues to exploit parallelism, such as: investigating which constraint algorithms are amenable to parallelisation; whether to use shared memory or distributed computation; whether to use static or dynamic decomposition; and how to best exploit portfolios and cooperating search. We review the literature, and see that we can sometimes do quite well, some of the time, on some instances, but we are far from a general solution. Yet there seems to be little overall guidance that can be given on how best to exploit multi-core computers to speed up constraint solving. We hope at least that this survey will provide useful pointers to future researchers wishing to correct this situation.

Download Full-text

Efficiency in Parallel Computing of FDS Model for Compartment Fire Simulation: Shared Memory System

Fire Science and Engineering ◽

10.7731/kifse.756398f1 ◽

2021 ◽

Vol 35 (2) ◽

pp. 16-22

Author(s):

Su-Gyeong Min ◽

Sung-Chan Kim

Keyword(s):

Domain Decomposition ◽

Shared Memory ◽

Computational Efficiency ◽

Decomposition Method ◽

Domain Decomposition Method ◽

Memory System ◽

Computational Domain ◽

Fire Simulation ◽

Speed Up ◽

Shared Memory System

This study evaluates the computational efficiency based on the parallel processing mode and domain decomposition method of the FDS model to enhance the computational performance of fire simulation. A single compartment of dimensions 12.0 m × 3.8 m × 3.0 m is considered along with a rectangular fire source (0.4 m × 0.4 m) fueled by n-Heptane. The computational domain was divided into 136,000 cells forming a grid size of 0.1 m, and the computational efficiency for each calculation was evaluated by the wall clock time for a simulation time of 300 s using a computational framework with 24 cores of a single CPU and a 256 GB shared memory system. The MPI and hybrid mode in FDS parallel offers a greater speed-up capability than the OpenMP mode, and the domain decomposition method used greatly affects the computational efficiency. The maximum speed-up with the OpenMP mode was less than 1.5 for a single computational domain, which indicates that there is an optimal condition for thread assignment and domain decomposition in the OpenMP mode. The present study is expected to contribute toward obtaining effective fire simulation results with limited computing power and time in fire protection engineering.

Download Full-text

Accelerating adaptive inverse distance weighting interpolation algorithm on a graphics processing unit

Royal Society Open Science ◽

10.1098/rsos.170436 ◽

2017 ◽

Vol 4 (9) ◽

pp. 170436 ◽

Cited By ~ 9

Author(s):

Gang Mei ◽

Liangliang Xu ◽

Nengxiong Xu

Keyword(s):

Shared Memory ◽

Graphics Processing Unit ◽

Inverse Distance Weighting ◽

Processing Unit ◽

Double Precision ◽

Distance Weighting ◽

Speed Up ◽

Graphics Processing ◽

Data Layouts ◽

Inverse Distance

This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points’ spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available.

Download Full-text

A Parallel Backward‐chaining Strategy Based on the Concept of OR‐Node Levels in the AND/OR Tree

Kybernetes ◽

10.1108/eb005949 ◽

1992 ◽

Vol 21 (7) ◽

pp. 29-47 ◽

Cited By ~ 1

Author(s):

K.R. Tout ◽

D.J. Evans

Keyword(s):

Expert System ◽

Shared Memory ◽

Search Tree ◽

Multiprocessor System ◽

Rule Based ◽

Backward Chaining ◽

Speed Up ◽

Shared Memory Multiprocessor

Applies a parallel backward‐chaining technique to a rule‐based expert system on a shared‐memory multiprocessor system. The condition for a processor to split up its search tree (task‐node) and generate new OR nodes is based on the level in the goal tree at which the task‐node is found. The results indicate satisfactory speed‐up performance for a small number of processors (< 10) and a reasonably large number of rules.

Download Full-text

SPEEDUP OF THE n-PROCESS MUTUAL EXCLUSION ALGORITHM

Parallel Processing Letters ◽

10.1142/s012962649900044x ◽

1999 ◽

Vol 09 (04) ◽

pp. 475-485 ◽

Cited By ~ 8

Author(s):

YOSHIHIDE IGARASHI ◽

YASUAKI NISHITANI

Keyword(s):

Shared Memory ◽

Upper Bound ◽

Critical Region ◽

Mutual Exclusion ◽

Memory Model ◽

Running Time ◽

Speed Up ◽

Mutual Exclusion Algorithm ◽

Modified Algorithm

We propose two modifications of the n-process mutual exclusion algorithm by Peterson for the asynchronous multi-writer/reader shared memory model. By any of the modifications we can speed up the original n-process algorithm. The running times for the trying regions of the first modified algorithm and the second modified algorithm are (2n - 3)c + O(n3 l) and (n - 1)c + O(n3 l), respectively, where n is the number of processes, l is an upper bound on the time between two steps, and c is an upper bound on the time that any user spends in the critical region. These running times are improvements on the running time, O(n2c + n4 l) of the original n-process algorithm for the same asynchronous shared memory model.

Download Full-text

Scanning x-ray fluorescence microscopy and principal component analysis

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100133394 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1752-1753

Author(s):

Brian Cross

Keyword(s):

Principal Component Analysis ◽

Fluorescence Microscopy ◽

Spatial Resolution ◽

High Speed ◽

Image Acquisition ◽

Principal Component ◽

Fluorescence Microscope ◽

Acquisition Rate ◽

X Ray ◽

Speed Up

A relatively new entry, in the field of microscopy, is the Scanning X-Ray Fluorescence Microscope (SXRFM). Using this type of instrument (e.g. Kevex Omicron X-ray Microprobe), one can obtain multiple elemental x-ray images, from the analysis of materials which show heterogeneity. The SXRFM obtains images by collimating an x-ray beam (e.g. 100 μm diameter), and then scanning the sample with a high-speed x-y stage. To speed up the image acquisition, data is acquired "on-the-fly" by slew-scanning the stage along the x-axis, like a TV or SEM scan. To reduce the overhead from "fly-back," the images can be acquired by bi-directional scanning of the x-axis. This results in very little overhead with the re-positioning of the sample stage. The image acquisition rate is dominated by the x-ray acquisition rate. Therefore, the total x-ray image acquisition rate, using the SXRFM, is very comparable to an SEM. Although the x-ray spatial resolution of the SXRFM is worse than an SEM (say 100 vs. 2 μm), there are several other advantages.

Download Full-text

Direct measurement of electron-diffraction-pattern intensities using an energy loss spectrometer

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100155311 ◽

1989 ◽

Vol 47 ◽

pp. 668-669

Author(s):

A. G. Jackson ◽

M. Rowe

Keyword(s):

Thermal Effects ◽

Dynamic Range ◽

Scattering Amplitude ◽

Dynamical Theory ◽

Site Symmetry ◽

Proof Of Concept ◽

Parallel Acquisition ◽

Kinematic Approximation ◽

Speed Up ◽

Structure Calculations

Diffraction intensities from intermetallic compounds are, in the kinematic approximation, proportional to the scattering amplitude from the element doing the scattering. More detailed calculations have shown that site symmetry and occupation by various atom species also affects the intensity in a diffracted beam. [1] Hence, by measuring the intensities of beams, or their ratios, the occupancy can be estimated. Measurement of the intensity values also allows structure calculations to be made to determine the spatial distribution of the potentials doing the scattering. Thermal effects are also present as a background contribution. Inelastic effects such as loss or absorption/excitation complicate the intensity behavior, and dynamical theory is required to estimate the intensity value.The dynamic range of currents in diffracted beams can be 104or 105:1. Hence, detection of such information requires a means for collecting the intensity over a signal-to-noise range beyond that obtainable with a single film plate, which has a S/N of about 103:1. Although such a collection system is not available currently, a simple system consisting of instrumentation on an existing STEM can be used as a proof of concept which has a S/N of about 255:1, limited by the 8 bit pixel attributes used in the electronics. Use of 24 bit pixel attributes would easily allowthe desired noise range to be attained in the processing instrumentation. The S/N of the scintillator used by the photoelectron sensor is about 106 to 1, well beyond the S/N goal. The trade-off that must be made is the time for acquiring the signal, since the pattern can be obtained in seconds using film plates, compared to 10 to 20 minutes for a pattern to be acquired using the digital scan. Parallel acquisition would, of course, speed up this process immensely.

Download Full-text