Generalized Load Sharing for Homogeneous Networks of Distributed Environment

We propose a method for job migration policies by considering effective usage of global memory in addition to CPU load sharing in distributed systems. When a node is identified for lacking sufficient memory space to serve jobs, one or more jobs of the node will be migrated to remote nodes with low memory allocations. If the memory space is sufficiently large, the jobs will be scheduled by a CPU-based load sharing policy. Following the principle of sharing both CPU and memory resources, we present several load sharing alternatives. Our objective is to reduce the number of page faults caused by unbalanced memory allocations for jobs among distributed nodes, so that overall performance of a distributed system can be significantly improved. We have conducted trace-driven simulations to compare CPU-based load sharing policies with our policies. We show that our load sharing policies not only improve performance of memory bound jobs, but also maintain the same load sharing quality as the CPU-based policies for CPU-bound jobs. Regarding remote execution and preemptive migration strategies, our experiments indicate that a strategy selection in load sharing is dependent on the amount of memory demand of jobs, remote execution is more effective for memory-bound jobs, and preemptive migration is more effective for CPU-bound jobs. Our CPU-memory-based policy using either high performance or high throughput approach and using the remote execution strategy performs the best for both CPU-bound and memory-bound job in homogeneous networks of distributed environment.

Download Full-text

CAVLCU: an efficient GPU-based implementation of CAVLC

The Journal of Supercomputing ◽

10.1007/s11227-021-04183-8 ◽

2021 ◽

Author(s):

Antonio Fuentes-Alventosa ◽

Juan Gómez-Luna ◽

José Maria González-Linares ◽

Nicolás Guil ◽

R. Medina-Carnicer

Keyword(s):

High Performance ◽

Data Encryption ◽

Entropy Method ◽

Global Memory ◽

Instruction Level Parallelism ◽

Thread Block ◽

Memory Space ◽

Synchronization Mechanism ◽

Memory Accesses ◽

The One

AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.

Download Full-text

Multimodal biometric system using deep learning based on face and finger vein fusion

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189762 ◽

2021 ◽

pp. 1-13

Author(s):

Shikhar Tyagi ◽

Bhavya Chawla ◽

Rupav Jain ◽

Smriti Srivastava

Keyword(s):

High Performance ◽

Recognition Accuracy ◽

Error Rates ◽

Facial Features ◽

Biometric System ◽

Deep Convolutional Neural Networks ◽

Finger Vein ◽

Biometric Systems ◽

Overall Performance ◽

Recognition Systems

Single biometric modalities like facial features and vein patterns despite being reliable characteristics show limitations that restrict them from offering high performance and robustness. Multimodal biometric systems have gained interest due to their ability to overcome the inherent limitations of the underlying single biometric modalities and generally have been shown to improve the overall performance for identification and recognition purposes. This paper proposes highly accurate and robust multimodal biometric identification as well as recognition systems based on fusion of face and finger vein modalities. The feature extraction for both face and finger vein is carried out by exploiting deep convolutional neural networks. The fusion process involves combining the extracted relevant features from the two modalities at score level. The experimental results over all considered public databases show a significant improvement in terms of identification and recognition accuracy as well as equal error rates.

Download Full-text

Bejan’s Constructal Theory Analysis of Gas-Liquid Cooled Finned Modules

Journal of Heat Transfer ◽

10.1115/1.4003556 ◽

2011 ◽

Vol 133 (7) ◽

Cited By ~ 14

Author(s):

Giulio Lorenzini ◽

Simone Moretti

Keyword(s):

Thermal Behavior ◽

Heat Exchangers ◽

High Performance ◽

Numerical Study ◽

Constructal Theory ◽

Electronic Components ◽

Theory Analysis ◽

Pressure Losses ◽

Different Types ◽

Overall Performance

High performance heat exchangers represent nowadays the key of success to go on with the trend of miniaturizing electronic components as requested by the industry. This numerical study, based on Bejan’s Constructal theory, analyzes the thermal behavior of heat removing fin modules, comparing their performances when operating with different types of fluids. In particular, the simulations involve air and water (as representative of gases and liquids), to understand the actual benefits of employing a less heat conductive fluid involving smaller pressure losses or vice versa. The analysis parameters typical of a Constructal description (such as conductance or Overall Performance Coefficient) show that significantly improved performances may be achieved when using water, even if an unavoidable increase in pressure losses affects the liquid-refrigerated case. Considering the overall performance: if the parameter called Relevance tends to 0, air prevails; if it tends to 1, water prevails; if its value is about 0.5, water prevails in most of the case studies.

Download Full-text

High performance SAR processing on an heterogeneous distributed environment

High-Performance Computing and Networking - Lecture Notes in Computer Science ◽

10.1007/bfb0037262 ◽

1998 ◽

pp. 1018-1020 ◽

Cited By ~ 1

Author(s):

F. P. Lovergine ◽

N. Veneziani

Keyword(s):

High Performance ◽

Distributed Environment ◽

Sar Processing

Download Full-text

A Chip for a Routing Table Based on a Novel Modified Trie Algorithm

VLSI Design ◽

10.1155/2000/81057 ◽

2000 ◽

Vol 11 (4) ◽

pp. 405-415

Author(s):

D. Torres ◽

A. Larios ◽

M. Guzmán

Keyword(s):

Data Structure ◽

High Performance ◽

Object Oriented ◽

Pci Bus ◽

Memory Space ◽

General Behavior ◽

Routing Table ◽

Starting Point ◽

Associated Data

The design for a routing table circuit for Ethernet-, IP- and ATM-applications is presented. Starting point for the design was an object-oriented general behavior of the routing table. The selected data structure for the routing table is based on a modification of the structure denominated trie, saving one search level and memory space. The architecture for searching and sorting of data, implemented in hardware, is explained. This modified trie stores 64 K addresses and the associated data, achieving a high performance too. The circuit, which can support a flow of 500000 frames/s, is connected to the PCI Bus. For the implementation a FLEX10K100 from Altera Company was used.

Download Full-text

VLSI ARCHITECTURE OF PARALLEL MULTIPLIER– ACCUMULATOR BASED ON RADIX-2 MODIFIED BOOTH ALGORITHM

International Journal of Electronics and Electical Engineering ◽

10.47893/ijeee.2012.1009 ◽

2012 ◽

pp. 40-46

Author(s):

Mr.M.V. Sathish ◽

Mrs. Sailaja

Keyword(s):

Signal Processing ◽

High Speed ◽

High Performance ◽

Vlsi Architecture ◽

Clock Frequency ◽

Parallel Multiplier ◽

Hybrid Type ◽

Standard Design ◽

Overall Performance ◽

And Performance

A new architecture of multiplier-andaccumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was elevated. The proposing method CSA tree uses 1’s-complement-based radix-2 modified Booth’s algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The proposed MAC showed the superior properties to the standard design in many ways and performance twice as much as the previous research in the similar clock frequency. We expect that the proposed MAC can be adapted to various fields requiring high performance such as the signal processing areas.

Download Full-text

Push-based Prefetching in Remote Memory Sharing System

Cloud, Grid and High Performance Computing ◽

10.4018/978-1-60960-603-9.ch017 ◽

2011 ◽

pp. 269-283

Author(s):

Rui Chu ◽

Nong Xiao ◽

Xicheng Lu

Keyword(s):

High Speed ◽

Pattern Mining ◽

Memory Capacity ◽

Communication Cost ◽

Memory Access ◽

Remote Memory ◽

Extra Space ◽

Memory Sharing ◽

Overall Performance ◽

Memory Resources

Remote memory sharing systems aim at the goal of improving overall performance using distributed computing nodes with surplus memory capacity. To exploit the memory resources connected by the high-speed network, the user nodes, which are short of memory, can obtain extra space provision. The performance of remote memory sharing is constrained with the expensive network communication cost. In order to hide the latency of remote memory access and improve the performance, we proposed the push-based prefetching to enable the memory providers to push the potential useful pages to the user nodes. For each provider, it employs sequential pattern mining techniques, which adapts to the characteristics of memory page access sequences, on locating useful memory pages for prefetching. We have verified the effectiveness of the proposed method through trace-driven simulations.

Download Full-text

Ballooning Graphics Memory Space in Full GPU Virtualization Environments

Scientific Programming ◽

10.1155/2019/5240956 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11

Author(s):

Younghun Park ◽

Minwoo Gu ◽

Sungyong Park

Keyword(s):

High Performance ◽

Virtual Machines ◽

Graphics Processing Unit ◽

Performance Degradation ◽

Processing Unit ◽

Memory Space ◽

Memory Size ◽

Memory Sharing ◽

Gpu Virtualization ◽

Graphics Processing

Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt, a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.

Download Full-text

Adaptive Context-Aware and Structural Correlation Filter for Visual Tracking

Applied Sciences ◽

10.3390/app9071338 ◽

2019 ◽

Vol 9 (7) ◽

pp. 1338 ◽

Cited By ~ 1

Author(s):

Bin Zhou ◽

Tuo Wang

Keyword(s):

Visual Tracking ◽

High Performance ◽

State Of The Art ◽

Correlation Filter ◽

Context Aware ◽

Partial Occlusion ◽

Structural Correlation ◽

Background Clutter ◽

Overall Performance ◽

Fast Motion

Accurate visual tracking is a challenging issue in computer vision. Correlation filter (CF) based methods are sought in visual tracking based on their efficiency and high performance. Nonetheless, traditional CF-based trackers have insufficient context information, and easily drift in scenes of fast motion or background clutter. Moreover, CF-based trackers are sensitive to partial occlusion, which may reduce their overall performance and even lead to failure in tracking challenge. In this paper, we presented an adaptive context-aware (CA) and structural correlation filter for tracking. Firstly, we propose a novel context selecting strategy to obtain negative samples. Secondly, to gain robustness against partial occlusion, we construct a structural correlation filter by learning both the holistic and local models. Finally, we introduce an adaptive updating scheme by using a fluctuation parameter. Extensive comprehensive experiments on object tracking benchmark (OTB)-100 datasets demonstrate that our proposed tracker performs favorably against several state-of-the-art trackers.

Download Full-text

Trap State and Charge Recombination in Nanocrystalline Passivized Conductive and Photoelectrode Interface of Dye-Sensitized Solar Cell

Coatings ◽

10.3390/coatings10030284 ◽

2020 ◽

Vol 10 (3) ◽

pp. 284 ◽

Cited By ~ 2

Author(s):

Siti Nur Azella Zaine ◽

Norani Muti Mohamed ◽

Mehboob Khatani ◽

Adel Eskandar Samsudin ◽

Muhammad Umair Shahid

Keyword(s):

High Performance ◽

Short Circuit ◽

Dye Sensitized Solar Cells ◽

Open Circuit ◽

Passivation Layer ◽

Electron Recombination ◽

Nanocrystalline Tio2 ◽

Photovoltaic Properties ◽

Dye Sensitized ◽

Overall Performance

The dynamic competition between electron generation and recombination was found to be a bottleneck restricting the development of high-performance dye-sensitized solar cells (DSSCs). Introducing a passivation layer on the surface of the TiO2 photoelectrode material plays a crucial role in separating the charge by preventing the recombination of photogenerated electrons with the oxidized species. This study aims to understand in detail the kinetics of the electron recombination process of a DSSC fabricated with a conductive substrate and photoelectrode film, both passivized with a layer of nanocrystalline TiO2. Interestingly, the coating, which acted as a passivation layer, suppressed the back-electron transfer and improved the overall performance of the integrated DSSC. The passivation layer reduced the exposed site of the fluorine-doped tin oxide (FTO)–electrolyte interface, thereby reducing the dark current phenomenon. In addition, the presence of the passivation layer reduced the rate of electron recombination related to the surface state recombination, as well as the trapping/de-trapping phenomenon. The photovoltaic properties of the nanocrystalline-coated DSSC, such as short-circuit current, open-circuit voltage, and fill factor, showed significant improvement compared to the un-coated photoelectrode film. The overall performance efficiency improved by about 22% compared to the un-coated photoelectrode-based DSSC.

Download Full-text