parallel acceleration
Recently Published Documents


TOTAL DOCUMENTS

95
(FIVE YEARS 36)

H-INDEX

10
(FIVE YEARS 3)

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
WenYu Feng ◽  
YuanFan Zhu ◽  
JunTai Zheng ◽  
Han Wang

YOLO-Tiny is a lightweight version of the object detection model based on the original “You only look once” (YOLO) model for simplifying network structure and reducing parameters, which makes it suitable for real-time applications. Although the YOLO-Tiny series, which includes YOLOv3-Tiny and YOLOv4-Tiny, can achieve real-time performance on a powerful GPU, it remains challenging to leverage this approach for real-time object detection on embedded computing devices, such as those in small intelligent trajectory cars. To obtain real-time and high-accuracy performance on these embedded devices, a novel object detection lightweight network called embedded YOLO is proposed in this paper. First, a new backbone network structure, ASU-SPP network, is proposed to enhance the effectiveness of low-level features. Then, we designed a simplified version of the neck network module PANet-Tiny that reduces computation complexity. Finally, in the detection head module, we use depthwise separable convolution to reduce the number of convolution stacks. In addition, the number of channels is reduced to 96 dimensions so that the module can attain the parallel acceleration of most inference frameworks. With its lightweight design, the proposed embedded YOLO model has only 3.53M parameters, and the average processing time can reach 155.1 frames per second, as verified by Baidu smart car target detection. At the same time, compared with YOLOv3-Tiny and YOLOv4-Tiny, the detection accuracy is 6% higher.


Author(s):  
Zhuo Ren ◽  
Yu Gu ◽  
Chuanwen Li ◽  
FangFang Li ◽  
Ge Yu

AbstractHyperspace hashing which is often applied to NoSQL data-bases builds indexes by mapping objects with multiple attributes to a multidimensional space. It can accelerate processing queries of some secondary attributes in addition to just primary keys. In recent years, the rich computing resources of GPU provide opportunities for implementing high-performance HyperSpace Hash. In this study, we construct a fully concurrent dynamic hyperspace hash table for GPU. By using atomic operations instead of locking, we make our approach highly parallel and lock-free. We propose a special concurrency control strategy that ensures wait-free read operations. Our data structure is designed considering GPU specific hardware characteristics. We also propose a warp-level pre-combinations data sharing strategy to obtain high parallel acceleration. Experiments on an Nvidia RTX2080Ti GPU suggest that GHSH performs about 20-100X faster than its counterpart on CPU. Specifically, GHSH performs updates with up to 396 M updates/s and processes search queries with up to 995 M queries/s. Compared to other GPU hashes that cannot conduct queries on non-key attributes, GHSH demonstrates comparable building and retrieval performance.


2021 ◽  
Author(s):  
Tsubasa Kotani ◽  
Masatoshi Yamauchi ◽  
Hans Nilsson ◽  
Gabriella Stenberg-Wieser ◽  
Martin Wieser ◽  
...  

<p>The ESA/Rosetta spacecraft has studied the comet 67P/Churyumov-Gerasimenko for two years. Rosetta Plasma Consortium's Ion Composition Analyser (RPC/ICA) detected comet-origin water ions that are accelerated to > 100 eV.<span>  </span>Majority of them are interpreted as ordinary pick-up acceleration<span>  </span>by the solar wind electric field perpendicular to the magnetic field during low comet activity [1,2]. As the comet approaches the sun, a comet magnetosphere is formed, where solar winds cannot intrude.</p><p>However,  some water ions are accelerated to > 1 keV even in the magnetosphere [3]. Using RPC/ICA data during two years [4], we investigate the acceleration events > 1 keV where solar winds are not observed, and classify dispersion events with respect to the directions of the sun, the comet, and the magnetic field.<span>  </span>Majority of these water ions show reversed energy-angle dispersion. <span>Results of the investigation also show that these ions are flowing along the (enhanced) magnetic field, indicating that the parallel acceleration occurs in the magnetosphere.</span></p><p>In this meeting, we show a statistical analysis and discuss a possible acceleration mechanism.</p><p><strong>References</strong></p><p>[1] H. Nilsson et al., MNRAS 469, 252 (2017), doi:10.1093/mnras/stx1491</p><p>[2] G. Nicolau et al., MNRAS 469, 339 (2017), doi:10.1093/mnras/stx1621</p><p>[3] T. Kotani et al., EPSC, EPSC2020-576 (2020), https://doi.org/10.5194/epsc2020-576</p><p>[4] H. Nilsson et al., Space Sci. Rev., 128, 671 (2007), DOI: 10.1007/s11214-006-9031-z </p>


2021 ◽  
Vol 14 (2) ◽  
pp. 843-857
Author(s):  
Pavel Perezhogin ◽  
Ilya Chernov ◽  
Nikolay Iakovlev

Abstract. In this paper, we present a parallel version of the finite-element model of the Arctic Ocean (FEMAO) configured for the White Sea and based on MPI technology. This model consists of two main parts: an ocean dynamics model and a surface ice dynamics model. These parts are very different in terms of the number of computations because the complexity of the ocean part depends on the bottom depth, while that of the sea-ice component does not. In the first step, we decided to locate both submodels on the same CPU cores with a common horizontal partition of the computational domain. The model domain is divided into small blocks, which are distributed over the CPU cores using Hilbert-curve balancing. Partitioning of the model domain is static (i.e., computed during the initialization stage). There are three baseline options: a single block per core, balancing of 2D computations, and balancing of 3D computations. After showing parallel acceleration for particular ocean and ice procedures, we construct the common partition, which minimizes joint imbalance in both submodels. Our novelty is using arrays shared by all blocks that belong to a CPU core instead of allocating separate arrays for each block, as is usually done. Computations on a CPU core are restricted by the masks of non-land grid nodes and block–core correspondence. This approach allows us to implement parallel computations into the model that are as simple as when the usual decomposition to squares is used, though with advances in load balancing. We provide parallel acceleration of up to 996 cores for the model with a resolution of 500×500×39 in the ocean component and 43 sea-ice scalars, and we carry out a detailed analysis of different partitions on the model runtime.


2020 ◽  
Author(s):  
Pavel Perezhogin ◽  
Ilya Chernov ◽  
Nikolay Iakovlev

Abstract. In this paper, we present a parallel version of the finite element model of the Arctic Ocean (FEMAO) configured for the White sea and based on the MPI technology. This model consists of two main parts: an ocean dynamics model and a surface ice dynamics model. These parts are very different in terms of the amount of computations because the complexity of the ocean part depends on the bottom depth, while that of the sea-ice component does not. In the first step, we decided to locate both submodels on the same CPU cores with the common horizontal partition of the computational domain. The model domain is divided into small blocks, which are distributed over the CPU cores using Hilbert-curve balancing. Partition of the model domain is static (i.e., computed during the initialization stage). There are three baseline options: single block per core, balancing of 2D computations and balancing of 3D computations. After showing parallel acceleration for particular ocean and ice procedures, we construct the common partition, which minimizes joint imbalance in both submodels. Our novelty is using arrays shared by all blocks that belong to a CPU core instead of allocating separate arrays for each block, as is usually done. Computations on a CPU core are restricted by the masks of not-land grid nodes and block-core correspondence. This approach allows us to implement parallel computations into the model that are as simple as when the usual decomposition to squares is used, though with advances of load balancing. We provide parallel acceleration of up to 996 cores for the model with resolution 500 × 500 × 39 in the ocean component and 43 sea-ice scalars, and we carry out detailed analysis of different partitions on the model runtime.


Author(s):  
Mitchell Nelson ◽  
Zachary Sorenson ◽  
Joseph M. Myre ◽  
Jason Sawin ◽  
David Chiu

Sign in / Sign up

Export Citation Format

Share Document