Theoretical Foundation and GPU Implementation of Face Recognition

Enabling a machine to detect and recognize faces requires significant computational power. This particular system of face recognition makes use of OpenCV (Computer Vision) libraries while leveraging Graphics Processing Units (GPUs) to accelerate the process towards real-time. The processing and recognition algorithms are best sorted into three distinct steps: detection, projection, and search. Each of these steps has unique computational characteristics and requirements driving performance. In particular, the detection and projection processes can be accelerated significantly with GPU usage due to the data types and arithmetic types associated with the algorithms, such as matrix manipulation. This chapter provides a survey of the three main processes and how they contribute to the overarching recognition process.

Download Full-text

Multiple-Precision BLAS Library for Graphics Processing Units

10.36227/techrxiv.12580301.v1 ◽

2020 ◽

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Arithmetic Operation ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Data Types ◽

Rounding Errors ◽

Multiple Precision ◽

Graphics Processing ◽

Point Arithmetic

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Download Full-text

Face Recognition-Based Attendance System Using Real-Time Computer Vision Algorithms

Advances in Intelligent Systems and Computing - Advanced Machine Learning Technologies and Applications ◽

10.1007/978-981-15-3383-9_4 ◽

2020 ◽

pp. 39-49

Author(s):

Darshankumar Dalwadi ◽

Yagnik Mehta ◽

Neel Macwan

Keyword(s):

Computer Vision ◽

Face Recognition ◽

Real Time

Download Full-text

XVA PRINCIPLES, NESTED MONTE CARLO STRATEGIES, AND GPU OPTIMIZATIONS

International Journal of Theoretical and Applied Finance ◽

10.1142/s0219024918500309 ◽

2018 ◽

Vol 21 (06) ◽

pp. 1850030 ◽

Cited By ~ 3

Author(s):

LOKMAN A. ABBAS-TURKI ◽

STÉPHANE CRÉPEY ◽

BABACAR DIALLO

Keyword(s):

Monte Carlo ◽

Interest Rate ◽

Outer Layer ◽

Graphics Processing Units ◽

Lower Layer ◽

Credit Derivatives ◽

Square Root ◽

Root Number ◽

Graphics Processing ◽

Gpu Implementation

We present a nested Monte Carlo (NMC) approach implemented on graphics processing units (GPUs) to X-valuation adjustments (XVAs), where X ranges over C for credit, F for funding, M for margin, and K for capital. The overall XVA suite involves five compound layers of dependence. Higher layers are launched first, and trigger nested simulations on-the-fly whenever required in order to compute an item from a lower layer. If the user is only interested in some of the XVA components, then only the sub-tree corresponding to the most outer XVA needs be processed computationally. Inner layers only need a square root number of simulation with respect to the most outer layer. Some of the layers exhibit a smaller variance. As a result, with GPUs at least, error-controlled NMC XVA computations are doable. But, although NMC is naively suited to parallelization, a GPU implementation of NMC XVA computations requires various optimizations. This is illustrated on XVA computations involving equities, interest rate, and credit derivatives, for both bilateral and central clearing XVA metrics.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text