GPU Parallelization of a Hybrid Pseudospectral Geophysical Turbulence Framework Using CUDA

An existing hybrid MPI-OpenMP scheme is augmented with a CUDA-based fine grain parallelization approach for multidimensional distributed Fourier transforms, in a well-characterized pseudospectral fluid turbulence code. Basics of the hybrid scheme are reviewed, and heuristics provided to show a potential benefit of the CUDA implementation. The method draws heavily on the CUDA runtime library to handle memory management and on the cuFFT library for computing local FFTs. The manner in which the interfaces to these libraries are constructed, and ISO bindings utilized to facilitate platform portability, are discussed. CUDA streams are implemented to overlap data transfer with cuFFT computation. Testing with a baseline solver demonstrated significant aggregate speed-up over the hybrid MPI-OpenMP solver by offloading to GPUs on an NVLink-based test system. While the batch streamed approach provided little benefit with NVLink, we saw a performance gain of 30 % when tuned for the optimal number of streams on a PCIe-based system. It was found that strong GPU scaling is nearly ideal, in all cases. Profiling of the CUDA kernels shows that the transform computation achieves 15% of the attainable peak FlOp-rate based on a roofline model for the system. In addition to speed-up measurements for the fiducial solver, we also considered several other solvers with different numbers of transform operations and found that aggregate speed-ups are nearly constant for all solvers.

Download Full-text

Automatic Calibration of a Two-Axis Rotary Table for 3D Scanning Purposes

Sensors ◽

10.3390/s20247107 ◽

2020 ◽

Vol 20 (24) ◽

pp. 7107

Author(s):

Livio Bisogni ◽

Ramtin Mollaiyan ◽

Matteo Pettinari ◽

Paolo Neri ◽

Marco Gabiccini

Keyword(s):

Point Clouds ◽

3D Scanning ◽

Optimal Number ◽

Acquisition Time ◽

Complex Geometries ◽

Automatic Calibration ◽

Camera System ◽

Reprojection Error ◽

Speed Up ◽

Manual Registration

Rotary tables are often used to speed up the acquisition time during the 3D scanning of complex geometries. In order to avoid manual registration of the point clouds acquired with different orientations, automatic algorithms to compensate the rotation were developed. Alternatively, a proper calibration of the rotary axis with respect to the camera system is needed. Several methods are available in the literature, but they only consider a single-axis calibration. In this paper, a method for the simultaneous calibration of both axes of the table is proposed. A checkerboard is attached to the table, and several images with different poses are acquired. An optimization algorithm is then setup to determine the orientation and the locations of the two axes. A metric to assess the calibration quality was also defined by computing the average mean reprojection error. This metric is used to investigate the optimal number and distribution of the calibration poses, demonstrating that the optimum calibration results are achieved when a wider dispersion of the calibration poses is adopted.

Download Full-text

HAMR: A dataflow-based real-time in-memory cluster computing engine

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016672080 ◽

2016 ◽

Vol 31 (5) ◽

pp. 361-374 ◽

Cited By ~ 3

Author(s):

Yao Wu ◽

Long Zheng ◽

Brian Heilig ◽

Guang R Gao

Keyword(s):

Big Data ◽

Memory Management ◽

High Performance ◽

Cluster Computing ◽

Programming Model ◽

Distributed Processing ◽

Large Data ◽

Computing System ◽

Fine Grain ◽

Execution Model

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is Hadoop, which uses MapReduce as a programming/execution model and takes disks as intermedia to process huge volumes of data. Spark, as an in-memory computing engine, can solve the iterative and interactive problems more efficiently. However, currently it is a consensus that they are not the final solutions to big data due to a MapReduce-like programming model, synchronous execution model and the constraint that only supports batch processing, and so on. A new solution, especially, a fundamental evolution is needed to bring big data solutions into a new era. In this paper, we introduce a new cluster computing system called HAMR which supports both batch and streaming processing. To achieve better performance, HAMR integrates high performance computing approaches, i.e. dataflow fundamental into a big data solution. With more specifications, HAMR is fully designed based on in-memory computing to reduce the unnecessary disk access overhead; task scheduling and memory management are in fine-grain manner to explore more parallelism; asynchronous execution improves efficiency of computation resource usage, and also makes workload balance across the whole cluster better. The experimental results show that HAMR can outperform Hadoop MapReduce and Spark by up to 19x and 7x respectively, in the same cluster environment. Furthermore, HAMR can handle scaling data size well beyond the capabilities of Spark.

Download Full-text

Optimal Number and Location of TCSC and Loadability Enhancement in Deregulated Electricity Markets Using MINLP

International Journal of Emerging Electric Power Systems ◽

10.2202/1553-779x.1117 ◽

2006 ◽

Vol 5 (1) ◽

Cited By ~ 19

Author(s):

Ashwani Kumar Sharma

Keyword(s):

Linear Programming ◽

Electricity Markets ◽

Test System ◽

Optimal Number ◽

Market Model ◽

Mixed Integer ◽

Programming Approach ◽

Reliability Test ◽

Crucial Issue ◽

Non Linear Programming

This paper proposes a new method of optimal number and location of TCSC using mixed integer non-linear programming approach in the deregulated electricity markets. Optimal number and location of TCSC controller can effectively enhance system loadability and their placement is a crucial issue due to their high cost. Since, in the competitive electricity environment more and more transactions are negotiated, which can compromise the system security. Therefore, it has become essential to determine secure transactions occurring in the new environment for better planning and management. The system loadability has been determined in a hybrid market model utilizing the secure transaction matrix. The proposed technique has been tested on IEEE 24-bus reliability test system (RTS).

Download Full-text

Employment of Telemedicine in Emergency Medicine

Methods of Information in Medicine ◽

10.3414/me13-01-0022 ◽

2014 ◽

Vol 53 (02) ◽

pp. 99-107 ◽

Cited By ~ 27

Author(s):

S. Bergrath ◽

R. Rossaint ◽

S. Thelen ◽

T. Brodziak ◽

B. Valentin ◽

...

Keyword(s):

Data Transfer ◽

Field Tests ◽

Test System ◽

Third Generation ◽

Improve Quality ◽

Network Availability ◽

Start Up ◽

Universal Mobile Telecommunications System ◽

Audio Communication ◽

System Properties

SummaryObjectives: Demographic change, rising comorbidity and an increasing number of emer -gencies are the main challenges that emer -gency medical services (EMS) in several countries worldwide are facing. In order to improve quality in EMS, highly trained personnel and well-equipped ambulances are essential. However several studies have shown a deficiency in qualified EMS physicians. Telemedicine emerges as a complementary system in EMS that may provide expertise and improve quality of medical treatment on the scene. Hence our aim is to develop and test a specific teleconsultation system.Methods: During the development process several use cases were defined and technically specified by medical experts and en -gineers in the areas of: system administration, start-up of EMS assistance systems, audio communication, data transfer, routine tele-EMS physician activities and research capabilities. Upon completion, technical field tests were performed under realistic conditions to test system properties such as robustness, feasibility and usability, providing end-to-end measurements.Results: Six ambulances were equipped with telemedical facilities based on the results of the requirement analysis and 55 scenarios were tested under realistic conditions in one month. The results indicate that the developed system performed well in terms of usability and robustness. The major challenges were, as expected, mobile communication and data network availability. Third generation networks were only available in 76.4% of the cases. Although 3G (third generation), such as Universal Mobile Telecommunications System (UMTS), provides beneficial conditions for higher bandwidth, system performance for most features was also acceptable under adequate 2G (second generation) test conditions.Conclusions: An innovative concept for the use of telemedicine for medical consultations in EMS was developed. Organisational and technical aspects were considered and practical requirements specified. Since technical feasibility was demonstrated in these technical field tests, the next step would be to prove medical usefulness and technical robustness under real conditions in a clinical trial.

Download Full-text

Novel type of PXI bus‐based airborne data transfer equipment test system

COMPEL The International Journal for Computation and Mathematics in Electrical and Electronic Engineering ◽

10.1108/03321640910992065 ◽

2009 ◽

Vol 28 (6) ◽

pp. 1532-1545

Author(s):

Haibin Duan ◽

Haixia Zhang

Keyword(s):

Data Transfer ◽

Test System

Download Full-text

Roll Casting of Al-SiCp Strip

Materials Science Forum ◽

10.4028/www.scientific.net/msf.675-677.811 ◽

2011 ◽

Vol 675-677 ◽

pp. 811-814 ◽

Cited By ~ 1

Author(s):

Toshio Haga ◽

Teppei Nakamura ◽

S. Kumai ◽

H. Watari

Keyword(s):

Hot Rolling ◽

High Speed ◽

Strip Casting ◽

Cast Strip ◽

Fine Grain ◽

Roll Casting ◽

Speed Up ◽

Cold Rolled ◽

Twin Roll Caster ◽

Twin Roll

The strip casting of Al-SiCp alloy was operated by a high speed twin roll caster. The content of SiCp was 20Vol% and 30Vol%. Both of Al-20Vol%SiCp and Al-30Vol%SiCp strips could be cast continuously at the speed up to 90m/min. The SiCp particle distributed uniformly. This was the effect of fine grain of the strip. The as-cast strip of Al-20Vol%SiCp could be cold rolled after homogenization. The as-cast strip of Al-30Vol%SiCp could be cold rolled after once hot rolling and annealing. The as-cast strip of Al-20Vol%SiCp could be coiled at the diameter of 460mm.

Download Full-text

The Design for High-Power Maglev Blower Controller

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.712-715.2747 ◽

2013 ◽

Vol 712-715 ◽

pp. 2747-2752

Author(s):

Wen Tao Yu ◽

Hong Wei Li ◽

Shu Qin Liu ◽

Yun Peng Zhang

Keyword(s):

High Power ◽

Data Transfer ◽

Sewage Treatment ◽

Magnetic Bearing ◽

Plain Bearing ◽

Current Signal ◽

Speed Up ◽

Bandstop Filter ◽

Dual Core

The maglev blower used in sewage treatment, which power is 115kW, speed up to 20000rpm, for plain bearing blower speed 6-8 times, volume is only 1/4 of the plain bearing and the noise is less than 80dB. The controller uses a DSP and FPGA dual-core controller to complete. DSP process the rotor suspension signal with digital bandstop filter program and improved PID program, FPGA complete the current signal amplification.Two chip via custom protocol to complete the data transfer. It is successful application of the controller that used in the magnetic bearing blower.

Download Full-text

Development of a PXI Express Peripheral Module and Data Transfer Platform

10.26686/wgtn.17004682.v1 ◽

2021 ◽

Author(s):

◽

Mathew David Bourne

Keyword(s):

Data Transfer ◽

Direct Memory Access ◽

Test System ◽

Memory Access ◽

Fpga Design ◽

Design Work ◽

Pci Express ◽

High Data ◽

A Company

<p>Magritek, a company who specialise in NMR and MRI devices, required a new backplane communication solution for transmission of data. Possible options were evaluated and it was decided to move to the PXI Express instrumentation standard. As a first step of moving to this system, an FPGA based PXI Express Peripheral Module was designed and constructed. In order to produce this device, details on PXI Express boards and the signals required were researched, and schematics produced. These were then passed onto the board designer who incorporated the design with other design work at Magritek to produce a PXI Express Peripheral Module for use as an NMR transceiver board. With the board designed, the FPGA was configured to provide PXI Express functionality. This was designed to allow PCI Express transfers at high data speeds using Direct Memory Access (DMA). The PXI Express Peripheral board was then tested and found to function correctly, providing Memory Write speeds of 228 MB/s and Memory Read speeds of 162 MB/s. Also, to provide a test system for this physical and FPGA design, backplanes were designed to test communication between PXI Express modules.</p>

Download Full-text

Performance of CUDA Unified Memory in CMS Heterogeneous Pixel Reconstruction

EPJ Web of Conferences ◽

10.1051/epjconf/202125103035 ◽

2021 ◽

Vol 251 ◽

pp. 03035

Author(s):

Matti J. Kortelainen ◽

Martin Kwok ◽

Keyword(s):

Explicit Memory ◽

Memory Management ◽

Data Transfer ◽

Programming Model ◽

Performance Impact ◽

Automatic Data ◽

Cms Experiment ◽

Additional Burden ◽

Reconstruction Software ◽

Memory Accesses

The management of separate memory spaces of CPUs and GPUs brings an additional burden to the development of software for GPUs. To help with this, CUDA unified memory provides a single address space that can be accessed from both CPU and GPU. The automatic data transfer mechanism is based on page faults generated by the memory accesses. This mechanism has a performance cost, that can be with explicit memory prefetch requests. Various hints on the inteded usage of the memory regions can also be given to further improve the performance. The overall effect of unified memory compared to an explicit memory management can depend heavily on the application. In this paper we evaluate the performance impact of CUDA unified memory using the heterogeneous pixel reconstruction code from the CMS experiment as a realistic use case of a GPU-targeting HEP reconstruction software. We also compare the programming model using CUDA unified memory to the explicit management of separate CPU and GPU memory spaces.

Download Full-text

On-line reconstruction of electron holograms

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100151155 ◽

1993 ◽

Vol 51 ◽

pp. 1064-1065

Author(s):

W.D. Ran

Keyword(s):

Fourier Transforms ◽

Numerical Data ◽

Complete Information ◽

Objective Lens ◽

Reconstruction Procedure ◽

Exit Surface ◽

Promising Tool ◽

Speed Up ◽

On Line ◽

Electron Image

Electron-off-axis holography has proven to be a most promising tool for the collection of the complete information about amplitude and phase modulation of the complex electron image wave in one single micrograph. Then amplitude and phase can be reconstructed numerically, offering almost any wave-optical possibility for the evaluation of the electron object wave at the exit surface, including the elimination of the influence of coherent aberrations of the objective lens in the image wave. Bottle-neck of the whole two step - registration and reconstruction - procedure has long been the need for highly accurate conversion of the holograms to numerical data as well as the available computational power, since the necessary Fast Fourier Transforms are time consuming numerical operations. Modern CCD slow-scan detectors and the permanently increasing computing power speed up this process and. combined with state of the art FEG microscopy, supply the reconstructed amplitude and phase of the image wave to the microscopist on-line.

Download Full-text