A Web-Lab Environment for the Study of the Job Shop Problem

This work proposes a Web-Based laboratory where researchers share the facilities of a simulation environment for parallel algorithms which solves scheduling problems known as Job Shop Problem (JSP). The environment supports multi-language platforms and uses a low cost, high performance Graphics Processing Unit (GPU) connected to a Java application server to help design more efficient solutions for JSP. Within a single web environment one can analyze and compare different methods and meta-heuristics. Each newly developed method is stored in an environment library and made available to all other users of the environment. This amassment of openly accessible solution methods will allow for the rapid convergence towards optimal solutions for JSP. The algorithm uses the parallel architecture of the system to handle threads. Each thread represents a job operation and the number of threads scales with the problem’s size. The threads exchange information in order to find the best solution. This cooperation decreases response times by one or two orders of magnitude.

Download Full-text

Parallel Architectures for MEDLINE Search

Encyclopedia of Healthcare Information Systems ◽

10.4018/978-1-59904-889-5.ch130 ◽

2008 ◽

pp. 1048-1055

Author(s):

Rajendra V. Boppana ◽

Suresh Chalasani ◽

Bob Badgett ◽

Jacqueline A. Pugh

Keyword(s):

High Performance Computing ◽

High Performance ◽

Response Times ◽

Low Cost ◽

Parallel Architecture ◽

Fast Response ◽

Parallel Architectures ◽

Medline Search ◽

Medline Database ◽

Performance Computing

In this article, we describe a parallel architecture for MEDLINE database integrated with search refinement tools to facilitate accurate and fast response to search requests by users. The proposed architecture, to be developed by the authors, will use low-cost, high-performance computing clusters consisting of Linux based personal computers and workstations (i) to provide subsecond response times for individual searches and (ii) to support several concurrent queries from search refinement programs such as SUMSearch.

Download Full-text

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Technologies ◽

10.3390/technologies8010006 ◽

2020 ◽

Vol 8 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Vasileios Leon ◽

Spyridon Mouselinos ◽

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris ◽

...

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Graphics Processing Unit ◽

Inference Engine ◽

Efficient Solutions ◽

Processing Unit ◽

Central Processing ◽

Inference Engines ◽

Field Programmable ◽

Hardware Description

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

Download Full-text

Real-time Visualisation and Analysis of Tera-scale Datasets

Proceedings of the International Astronomical Union ◽

10.1017/s1743921314012873 ◽

2012 ◽

Vol 10 (H16) ◽

pp. 679-680

Author(s):

Christopher J. Fluke

Keyword(s):

Real Time ◽

High Performance ◽

Graphics Processing Unit ◽

Low Cost ◽

Processing Unit ◽

Computing Environments ◽

Graphics Processing ◽

Interactive Visualisation ◽

Performance Computing ◽

Scale Data

AbstractAs we move ever closer to the Square Kilometre Array era, support for real-time, interactive visualisation and analysis of tera-scale (and beyond) data cubes will be crucial for on-going knowledge discovery. However, the data-on-the-desktop approach to analysis and visualisation that most astronomers are comfortable with will no longer be feasible: tera-scale data volumes exceed the memory and processing capabilities of standard desktop computing environments. Instead, there will be an increasing need for astronomers to utilise remote high performance computing (HPC) resources. In recent years, the graphics processing unit (GPU) has emerged as a credible, low cost option for HPC. A growing number of supercomputing centres are now investing heavily in GPU technologies to provide O(100) Teraflop/s processing. I describe how a GPU-powered computing cluster allows us to overcome the analysis and visualisation challenges of tera-scale data. With a GPU-based architecture, we have moved the bottleneck from processing-limited to bandwidth-limited, achieving exceptional real-time performance for common visualisation and data analysis tasks.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Prediction of Residual Stresses in a Multipass Pipe Weld by a Novel 3D Finite Element Approach

Volume 6B: Materials and Fabrication ◽

10.1115/pvp2018-85044 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hui Huang ◽

Jian Chen ◽

Blair Carlson ◽

Hui-Ping Wang ◽

Paul Crooker ◽

...

Keyword(s):

Finite Element ◽

Residual Stresses ◽

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Computational Cost ◽

Three Dimensional ◽

Processing Unit ◽

Girth Welds ◽

Welding Processes

Due to enormous computation cost, current residual stress simulation of multipass girth welds are mostly performed using two-dimensional (2D) axisymmetric models. The 2D model can only provide limited estimation on the residual stresses by assuming its axisymmetric distribution. In this study, a highly efficient thermal-mechanical finite element code for three dimensional (3D) model has been developed based on high performance Graphics Processing Unit (GPU) computers. Our code is further accelerated by considering the unique physics associated with welding processes that are characterized by steep temperature gradient and a moving arc heat source. It is capable of modeling large-scale welding problems that cannot be easily handled by the existing commercial simulation tools. To demonstrate the accuracy and efficiency, our code was compared with a commercial software by simulating a 3D multi-pass girth weld model with over 1 million elements. Our code achieved comparable solution accuracy with respect to the commercial one but with over 100 times saving on computational cost. Moreover, the three-dimensional analysis demonstrated more realistic stress distribution that is not axisymmetric in hoop direction.

Download Full-text

Fast X-Ray Diffraction (XRD) Tomography for Enhanced Identification of Materials

10.36227/techrxiv.17125448.v1 ◽

2021 ◽

Author(s):

Airidas Korolkovas ◽

Alexander Katsevich ◽

Michael Frenkel ◽

William Thompson ◽

Edward Morton

Keyword(s):

Graphics Processing Unit ◽

Low Cost ◽

Photon Counting ◽

Finite Size ◽

Processing Unit ◽

X Ray Diffraction ◽

X Ray ◽

Specific Material ◽

Xrd Patterns ◽

Graphics Processing

X-ray computed tomography (CT) can provide 3D images of density, and possibly the atomic number, for large objects like passenger luggage. This information, while generally very useful, is often insufficient to identify threats like explosives and narcotics, which can have a similar average composition as benign everyday materials such as plastics, glass, light metals, etc. A much more specific material signature can be measured with X-ray diffraction (XRD). Unfortunately, XRD signal is very faint compared to the transmitted one, and also challenging to reconstruct for objects larger than a small laboratory sample. In this article we analyze a novel low-cost scanner design which captures CT and XRD signals simultaneously, and uses the least possible collimation to maximize the flux. To simulate a realistic instrument, we derive a formula for the resolution of any diffraction pathway, taking into account the polychromatic spectrum, and the finite size of the source, detector, and each voxel. We then show how to reconstruct XRD patterns from a large phantom with multiple diffracting objects. Our approach includes a reasonable amount of photon counting noise (Poisson statistics), as well as measurement bias, in particular incoherent Compton scattering. The resolution of our reconstruction is sufficient to provide significantly more information than standard CT, thus increasing the accuracy of threat detection. Our theoretical model is implemented in GPU (Graphics Processing Unit) accelerated software which can be used to assess and further optimize scanner designs for specific applications in security, healthcare, and manufacturing quality control.

Download Full-text

High-Performance, Graphics Processing Unit-Accelerated Fock Build Algorithm

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.0c00768 ◽

2020 ◽

Vol 16 (12) ◽

pp. 7232-7238

Author(s):

Giuseppe M. J. Barca ◽

Jorge L. Galvez-Vallejo ◽

David L. Poole ◽

Alistair P. Rendell ◽

Mark S. Gordon

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Ballooning Graphics Memory Space in Full GPU Virtualization Environments

Scientific Programming ◽

10.1155/2019/5240956 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11

Author(s):

Younghun Park ◽

Minwoo Gu ◽

Sungyong Park

Keyword(s):

High Performance ◽

Virtual Machines ◽

Graphics Processing Unit ◽

Performance Degradation ◽

Processing Unit ◽

Memory Space ◽

Memory Size ◽

Memory Sharing ◽

Gpu Virtualization ◽

Graphics Processing

Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt, a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.

Download Full-text

GPU empowered pipelines for calculating genome-wide kinship matrices with ultra-high dimensional genetic variants and facilitating 1D and 2D GWAS

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz009 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Wenchao Zhang ◽

Xinbin Dai ◽

Shizhong Xu ◽

Patrick X Zhao

Keyword(s):

Genetic Variants ◽

High Performance ◽

Genome Wide Association Study ◽

Graphics Processing Unit ◽

High Dimensional ◽

Processing Unit ◽

Kinship Matrix ◽

Matrix Operations ◽

Genome Wide ◽

Matrix Calculation

Abstract Genome-wide association study (GWAS) is a powerful approach that has revolutionized the field of quantitative genetics. Two-dimensional GWAS that accounts for epistatic genetic effects needs to consider the effects of marker pairs, thus quadratic genetic variants, compared to one-dimensional GWAS that accounts for individual genetic variants. Calculating genome-wide kinship matrices in GWAS that account for relationships among individuals represented by ultra-high dimensional genetic variants is computationally challenging. Fortunately, kinship matrix calculation involves pure matrix operations and the algorithms can be parallelized, particular on graphics processing unit (GPU)-empowered high-performance computing (HPC) architectures. We have devised a new method and two pipelines: KMC1D and KMC2D for kinship matrix calculation with high-dimensional genetic variants, respectively, facilitating 1D and 2D GWAS analyses. We first divide the ultra-high-dimensional markers and marker pairs into successive blocks. We then calculate the kinship matrix for each block and merge together the block-wise kinship matrices to form the genome-wide kinship matrix. All the matrix operations have been parallelized using GPU kernels on our NVIDIA GPU-accelerated server platform. The performance analyses show that the calculation speed of KMC1D and KMC2D can be accelerated by 100–400 times over the conventional CPU-based computing.

Download Full-text