graphic processing units
Recently Published Documents


TOTAL DOCUMENTS

178
(FIVE YEARS 35)

H-INDEX

15
(FIVE YEARS 3)

2021 ◽  
Vol 4 ◽  
pp. 10-15
Author(s):  
Gennadii Malaschonok ◽  
Serhii Sukharskyi

With the development of the Big Data sphere, as well as those fields of study that we can relate to artificial intelligence, the need for fast and efficient computing has become one of the most important tasks nowadays. That is why in the recent decade, graphics processing unit computations have been actively developing to provide an ability for scientists and developers to use thousands of cores GPUs have in order to perform intensive computations. The goal of this research is to implement orthogonal decomposition of a matrix by applying a series of Householder transformations in Java language using JCuda library to conduct a research on its benefits. Several related papers were examined. Malaschonok and Savchenko in their work have introduced an improved version of QR algorithm for this purpose [4] and achieved better results, however Householder algorithm is more promising for GPUs according to another team of researchers – Lahabar and Narayanan [6]. However, they were using Float numbers, while we are using Double, and apart from that we are working on a new BigDecimal type for CUDA. Apart from that, there is still no solution for handling huge matrices where errors in calculations might occur. The algorithm of orthogonal matrix decomposition, which is the first part of SVD algorithm, is researched and implemented in this work. The implementation of matrix bidiagonalization and calculation of orthogonal factors by the Hausholder method in the jCUDA environment on a graphics processor is presented, and the algorithm for the central processor for comparisons is also implemented. Research of the received results where we experimentally measured acceleration of calculations with the use of the graphic processor in comparison with the implementation on the central processor are carried out. We show a speedup up to 53 times compared to CPU implementation on a big matrix size, specifically 2048, and even better results when using more advanced GPUs. At the same time, we still experience bigger errors in calculations while using graphic processing units due to synchronization problems. We compared execution on different platforms (Windows 10 and Arch Linux) and discovered that they are almost the same, taking the computation speed into account. The results have shown that on GPU we can achieve better performance, however there are more implementation difficulties with this approach.


2021 ◽  
Author(s):  
Charles Mackin ◽  
Malte Rasch ◽  
An Chen ◽  
Jonathan Timcheck ◽  
Robert Bruce ◽  
...  

Abstract Analogue memory-based Deep Neural Networks (DNNs) provide energy-efficiency and per-area throughput gains relative to state-of-the-art digital counterparts such as graphic processing units (GPUs). Recent advances focus largely on hardware-aware algorithmic training and improvements in circuits, architectures, and memory device characteristics. Optimal translation of software-trained weights into analogue hardware weights---given the plethora of complex memory non-idealities---represents an equally important goal in realizing the full potential of this technology. We report a generalized computational framework that automates the process of crafting complex weight programming strategies for analogue memory-based DNNs, in order to minimize accuracy degradations during inference, particularly over time. This framework is agnostic to DNN structure and is shown to generalize well across Long Short-Term Memory (LSTM), Convolution Neural Networks (CNNs), and Transformer networks. Being a highly-flexible numerical heuristic, our approach can accommodate arbitrary device-level complexity, and is thus broadly applicable to a variety of analogue memories and their continually evolving device characteristics. Interestingly, this computational technique is capable of optimizing inference accuracy without the need to run inference simulations or evaluate large training, validation, or test datasets. Lastly, by quantifying the limit of achievable inference accuracy given imperfections in analogue memory, weight programming optimization represents a unique and foundational tool for enabling analogue memory-based DNN accelerators to reach their full inference potential.


Electronics ◽  
2021 ◽  
Vol 10 (21) ◽  
pp. 2630
Author(s):  
Enrico M. Vitucci ◽  
Jonathan S. Lu ◽  
Scot Gordon ◽  
Jian Jet Zhu ◽  
Vittorio Degli-Esposti

In this work, the Discrete, Environment-Driven Ray Launching (DED-RL) algorithm, which makes use of parallelization on Graphic Processing Units, fully described in a previous paper, has been validated versus a large set of measurements to evaluate its performance in terms of both computational efficiency and accuracy. Three major urban areas have been considered, including a very challenging scenario in central San Francisco that was used as a benchmark to test an image-ray tracing algorithm in a previous work. Results show that DED-RL is as accurate as ray tracing, despite the much lower computation time, reduced by more than three orders of magnitude with respect to ray tracing. Moreover, the accuracy level only marginally depends on discretization pixel size, at least for the considered pixel size range. The unprecedented computational efficiency of DED-RL opens the way to numerous applications, ranging from RF coverage optimization of drone-aided cellular networks to efficient fingerprinting localization applications, as briefly discussed in the paper.


Hydrology ◽  
2021 ◽  
Vol 8 (4) ◽  
pp. 146
Author(s):  
Javier Fernández-Pato ◽  
Pilar García-Navarro

Numerical simulation of flows that consider interaction between overland and drainage networks has become a practical tool to prevent and mitigate flood situations in urban environments, especially when dealing with intense storm events, where the limited capacity of the sewer systems can be a trigger for flooding. Additionally, in order to prevent any kind of pollutant dispersion through the drainage network, it is very interesting to have a certain monitorization or control over the quality of the water that flows in both domains. In this sense, the addition of a pollutant transport component to both surface and sewer hydraulic models would benefit the global analysis of the combined water flow. On the other hand, when considering a realistic large domain with complex topography or streets structure, a fine spatial discretization is mandatory. Hence the number of grid cells is usually very large and, therefore, it is necessary to use parallelization techniques for the calculation, the use of Graphic Processing Units (GPU) being one of the most efficient due to the leveraging of thousands of processors within a single device. In this work, an efficient GPU-based 2D shallow water flow solver (RiverFlow2D-GPU) is fully coupled with EPA’s Storm Water Management Model (SWMM). Both models are able to develop a transient water quality analysis taking into account several pollutants. The coupled model, referred to as RiverFlow2D-GPU UD (Urban Drainge) is applied to three real-world cases, covering the most common hydraulic situations in urban hydrology/hydraulics. A UK Environmental Agency test case is used as model validation, showing a good agreement between RiverFlow2D-GPU UD and the rest of the numerical models considered. The efficiency of the model is proven in two more complex domains, leading to a >100x faster simulations compared with the traditional CPU computation.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Pisit Makpaisit ◽  
Chantana Chantrapornchai

AbstractResource Description Framework (RDF) is commonly used as a standard for data interchange on the web. The collection of RDF data sets can form a large graph which consumes time to query. It is known that modern Graphic Processing Units (GPUs) can be employed to execute parallel programs in order to speedup the running time. In this paper, we propose a novel RDF data representation along with the query processing algorithm that is suitable for GPU processing. Since the main challenges of GPU architecture are the limited memory sizes, the memory transfer latency, and the vast number of GPU cores. Our system is designed to strengthen the use of GPU cores and reduce the effect of memory transfer. We propose a representation consists of indices and column-based RDF ID data that can reduce the GPU memory requirement. The indexing and pre-upload filtering techniques are then applied to reduce the data transfer between the host and GPU memory. We add the index swapping process to facilitate the sorting and joining data process based on the given variable and add the pre-upload step to reduce the size of results’ storage, and the data transfer time. The experimental results show that our representation is about 35% smaller than the traditional NT format and 40% less compared to that of gStore. The query processing time can be speedup ranging from 1.95 to 397.03 when compared with RDF3X and gStore processing time with WatDiv test suite. It achieves speedup 578.57 and 62.97 for LUBM benchmark when compared to RDF-3X and gStore. The analysis shows the query cases which can gain benefits from our approach.


Author(s):  
Ossian O’Reilly ◽  
Te-Yang Yeh ◽  
Kim B. Olsen ◽  
Zhifeng Hu ◽  
Alex Breuer ◽  
...  

ABSTRACT We developed a 3D elastic wave propagation solver that supports topography using staggered curvilinear grids. Our method achieves comparable accuracy to the classical fourth-order staggered grid velocity–stress finite-difference method on a Cartesian grid. We show that the method is provably stable using summation-by-parts operators and weakly imposed boundary conditions via penalty terms. The maximum stable timestep obeys a relationship that depends on the topography-induced grid stretching along the vertical axis. The solutions from the approach are in excellent agreement with verified results for a Gaussian-shaped hill and for a complex topographic model. Compared with a Cartesian grid, the curvilinear grid adds negligible memory requirements, but requires longer simulation times due to smaller timesteps for complex topography. The code shows 94% weak scaling efficiency up to 1014 graphic processing units.


Universe ◽  
2021 ◽  
Vol 7 (7) ◽  
pp. 218
Author(s):  
Iuri La Rosa ◽  
Pia Astone ◽  
Sabrina D’Antonio ◽  
Sergio Frasca ◽  
Paola Leaci ◽  
...  

We present a new approach to searching for Continuous gravitational Waves (CWs) emitted by isolated rotating neutron stars, using the high parallel computing efficiency and computational power of modern Graphic Processing Units (GPUs). Specifically, in this paper the porting of one of the algorithms used to search for CW signals, the so-called FrequencyHough transform, on the TensorFlow framework, is described. The new code has been fully tested and its performance on GPUs has been compared to those in a CPU multicore system of the same class, showing a factor of 10 speed-up. This demonstrates that GPU programming with general purpose libraries (the those of the TensorFlow framework) of a high-level programming language can provide a significant improvement of the performance of data analysis, opening new perspectives on wide-parameter searches for CWs.


Materials ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 3291
Author(s):  
Fulu Zheng ◽  
Lipeng Chen ◽  
Jianbo Gao ◽  
Yang Zhao

It has long been a challenge to accurately and efficiently simulate exciton–phonon dynamics in mesoscale photosynthetic systems with a fully quantum mechanical treatment due to extensive computational resources required. In this work, we tackle this seemingly intractable problem by combining the Dirac–Frenkel time-dependent variational method with Davydov trial states and implementing the algorithm in graphic processing units. The phonons are treated on the same footing as the exciton. Tested with toy models, which are nanoarrays of the B850 pigments from the light harvesting 2 complexes of purple bacteria, the methodology is adopted to describe exciton diffusion in huge systems containing more than 1600 molecules. The superradiance enhancement factor extracted from the simulations indicates an exciton delocalization over two to three pigments, in agreement with measurements of fluorescence quantum yield and lifetime in B850 systems. With fractal analysis of the exciton dynamics, it is found that exciton transfer in B850 nanoarrays exhibits a superdiffusion component for about 500 fs. Treating the B850 ring as an aggregate and modeling the inter-ring exciton transfer as incoherent hopping, we also apply the method of classical master equations to estimate exciton diffusion properties in one-dimensional (1D) and two-dimensional (2D) B850 nanoarrays using derived analytical expressions of time-dependent excitation probabilities. For both coherent and incoherent propagation, faster energy transfer is uncovered in 2D nanoarrays than 1D chains, owing to availability of more numerous propagating channels in the 2D arrangement.


Sign in / Sign up

Export Citation Format

Share Document