Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components

2021 ◽  
Vol 14 (3) ◽  
pp. 1-33
Author(s):  
Enrico Reggiani ◽  
Emanuele DEL Sozzo ◽  
Davide Conficconi ◽  
Giuseppe Natale ◽  
Carlo Moroni ◽  
...  

Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various incarnations of stencil-based computations, Iterative Stencil Loops (ISLs) and Convolutional Neural Networks (CNNs) represent two well-known examples of kernels belonging to the stencil class. Indeed, ISLs apply the same stencil several times until convergence, while CNN layers leverage stencils to extract features from an image. The computationally intensive essence of ISLs, CNNs, and in general stencil-based workloads, requires solutions able to produce efficient implementations in terms of throughput and power efficiency. In this context, FPGAs are ideal candidates for such workloads, as they allow design architectures tailored to the stencil regular computational pattern. Moreover, the ever-growing need for performance enhancement leads FPGA-based architectures to scale to multiple devices to benefit from a distributed acceleration. For this reason, we propose a library of HDL components to effectively compute ISLs and CNNs inference on FPGA, along with a scalable multi-FPGA architecture, based on custom PCB interconnects. Our solution eases the design flow and guarantees both scalability and performance competitive with state-of-the-art works.

Electronics ◽  
2021 ◽  
Vol 10 (14) ◽  
pp. 1614
Author(s):  
Jonghun Jeong ◽  
Jong Sung Park ◽  
Hoeseok Yang

Recently, the necessity to run high-performance neural networks (NN) is increasing even in resource-constrained embedded systems such as wearable devices. However, due to the high computational and memory requirements of the NN applications, it is typically infeasible to execute them on a single device. Instead, it has been proposed to run a single NN application cooperatively on top of multiple devices, a so-called distributed neural network. In the distributed neural network, workloads of a single big NN application are distributed over multiple tiny devices. While the computation overhead could effectively be alleviated by this approach, the existing distributed NN techniques, such as MoDNN, still suffer from large traffics between the devices and vulnerability to communication failures. In order to get rid of such big communication overheads, a knowledge distillation based distributed NN, called Network of Neural Networks (NoNN), was proposed, which partitions the filters in the final convolutional layer of the original NN into multiple independent subsets and derives smaller NNs out of each subset. However, NoNN also has limitations in that the partitioning result may be unbalanced and it considerably compromises the correlation between filters in the original NN, which may result in an unacceptable accuracy degradation in case of communication failure. In this paper, in order to overcome these issues, we propose to enhance the partitioning strategy of NoNN in two aspects. First, we enhance the redundancy of the filters that are used to derive multiple smaller NNs by means of averaging to increase the immunity of the distributed NN to communication failure. Second, we propose a novel partitioning technique, modified from Eigenvector-based partitioning, to preserve the correlation between filters as much as possible while keeping the consistent number of filters distributed to each device. Throughout extensive experiments with the CIFAR-100 (Canadian Institute For Advanced Research-100) dataset, it has been observed that the proposed approach maintains high inference accuracy (over 70%, 1.53× improvement over the state-of-the-art approach), on average, even when a half of eight devices in a distributed NN fail to deliver their partial inference results.


2020 ◽  
Vol 56 (4) ◽  
pp. 535-538 ◽  
Author(s):  
Jungwon Kim ◽  
Gyeongseop Lee ◽  
Kisu Lee ◽  
Haejun Yu ◽  
Jong Woo Lee ◽  
...  

We first manufactured an F plasma-treated carbon electrode-based high performance perovskite solar cell with strong moisture resistance.


2011 ◽  
Vol 2011 ◽  
pp. 1-11 ◽  
Author(s):  
Daehyun Kim ◽  
Joshua Trzasko ◽  
Mikhail Smelyanskiy ◽  
Clifton Haider ◽  
Pradeep Dubey ◽  
...  

Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability.


Author(s):  
M. Narayana Moorthi ◽  
R. Manjula

Now a day the architecture of high performance systems are improving with more and more processor cores on the chip. This has both benefits as well as challenges. The benefit is running more task simultaneously which reduces the running time of the program or application. The challenges are what is the maximum limit of the number of cores in the given chip, how the existing and future software will make use of all the cores, what parallel programming language to choose, what are the memory and cache coherence issues involved when we increase the number of cores, how to solve the power and performance issues, how the cores are connected and how they are communicating to solve a single problem, workload distribution and load balancing issues in terms of scalability. There is a practical limit for speedup and scalability of number of cores on the chip which needs to be analyzed. So this chapter will focus on the introduction and overviews of parallel computing and the challenges faced in enhancing the performance and scalability in parallel computing architecture.


Author(s):  
Mahadevan Suryakumar ◽  
Lu-Vong T. Phan ◽  
Mathew Ma ◽  
Wajahat Ahmed

The alarming growth of power increase has presented numerous packaging challenges for high performance processors. The average power consumed by a processor is the sum of dynamic and leakage power. The dynamic power is proportional to V^2, while the leakage current (therefore leakage power) is proportional to V^b where V is the voltage and b>1 for modern processes. This means lowering voltage reduces energy consumed per clock cycle but reduces the maximum frequency at which the processor can operate at. Since reducing voltage reduces power faster than it does frequency, integrating more cores into the processor would result in better performance/power efficiency but would generate more memory accesses, driving a need for larger cache and high speed signaling [1]. In addition, the design goal to create unified package pinout for both single core and multicore product flavors adds additional constraint to create a cost effective package solution for both market segments. This paper discusses the design strategy and performance of dual die package to optimize package performance for cost.


Processes ◽  
2020 ◽  
Vol 8 (5) ◽  
pp. 607
Author(s):  
Omer Mohamed Abubaker Al-hotmani ◽  
Mudhar Abdul Alwahab Al-Obaidi ◽  
Yakubu Mandafiya John ◽  
Raj Patel ◽  
Iqbal Mohammed Mujtaba

In recent times two or more desalination processes have been combined to form integrated systems that have been widely used to resolve the limitations of individual processes as well as producing high performance systems. In this regard, a simple integrated system of the Multi Effect Distillation (MED)/Thermal Vapour Compression (TVC) and Permeate Reprocessing Reverse Osmosis (PRRO) process was developed by the same authors and confirmed its validity after a comparison study against other developed configurations. However, this design has a considerable amount of retentate flowrate and low productivity. To resolve this issue, two novel designs of MED and double reverse osmosis (RO) processes including Permeate and Retentate Reprocessing designs (PRRP and RRRO) are developed and modelled in this paper. To systematically assess the consistency of the presented designs, the performance indicators of the novel designs are compared against previous simple designs of MED and PRRO processes at a specified set of operating conditions. Results show the superiority of the integrated MED and double permeate reprocessing design. This has specifically achieved both economic and environmental advantages where total productivity is increased by around 9% and total retentate flowrate (disposed to water bodies) is reduced by 5% with a marginally reduced energy consumption.


Nanophotonics ◽  
2020 ◽  
Vol 9 (15) ◽  
pp. 4579-4588
Author(s):  
Chenghao Feng ◽  
Zhoufeng Ying ◽  
Zheng Zhao ◽  
Jiaqi Gu ◽  
David Z. Pan ◽  
...  

AbstractIntegrated photonics offers attractive solutions for realizing combinational logic for high-performance computing. The integrated photonic chips can be further optimized using multiplexing techniques such as wavelength-division multiplexing (WDM). In this paper, we propose a WDM-based electronic–photonic switching network (EPSN) to realize the functions of the binary decoder and the multiplexer, which are fundamental elements in microprocessors for data transportation and processing. We experimentally demonstrate its practicality by implementing a 3–8 (three inputs, eight outputs) switching network operating at 20 Gb/s. Detailed performance analysis and performance enhancement techniques are also given in this paper.


Author(s):  
Dong Yan ◽  
Mengxia Liu ◽  
Zhe Li ◽  
Bo Hou

Metal halide perovskites and colloidal quantum dots (QDs) are two emerging class of photoactive materials that has been attracted considerable attention for next-generation high-performance solution-processed solar cells. In particular, the...


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Shippu Sachdeva ◽  
Jagjit Malhotra ◽  
Manoj Kumar

Abstract Long reach Passive optical network (LR-PON) is an attractive solution to fulfill the ever-increasing bandwidth requirements due to propelling internet applications and competent to serve distant optical network units (ONUs). Wavelength division multiplexed (WDM) PON systems experience distance and performance limiting constraint termed as Dispersion. In order to compensate dispersion effects, Fiber bragg gratings (FBGs) and Dispersion compensation fibers (DCFs) are incorporated extensively in PONs. Performance of DCF is better than FBG in terms of dispersion compensation, but it comes at the cost of 3 $/m (very expensive). Therefore, long reach ultra dense WDM-PON systems are needed with incorporation of economical and high performance DCMs. Three newly constructed hybrid DCMs are investigated such as FBG-DCF (module 1), OPC-DCF (module 2), and FBG-DCF-OPC (module 3) in WDM-PON to get optimal DCM in terms of dispersion compensation efficiency (DCE) and economical operation. As per author’s best knowledge, DCE calculations and performance enhancement with cost reduction using hybrid DCMs in ultra dense WDM-PON, is not reported so far. WDM-PON consists of 32 channels at 25 GHz channel spacing is analyzed for 300 km link distance at 10 Gbps/channel using different hybrid DCMs. It is perceived that highest DCE of 70% is given by module 3 with maximum cost reduction of 19.84%. DCE performance of three modules is as follows: Module 3 (DCE 70%), Module 1 (DCE 55%), Module 2 (DCE 45%) and cost reduction/increase from conventional module by 19.84% reduction (Module 3), 19.05% reduction (Module 1), and increase 10.5% (Module 2). Hence, Module 3 is preferred for long reach WDM-PON to get high performance with lesser cost.


Sensors ◽  
2021 ◽  
Vol 21 (17) ◽  
pp. 5916
Author(s):  
Diego Romano ◽  
Marco Lapegna

Image Coregistration for InSAR processing is a time-consuming procedure that is usually processed in batch mode. With the availability of low-energy GPU accelerators, processing at the edge is now a promising perspective. Starting from the individuation of the most computationally intensive kernels from existing algorithms, we decomposed the cross-correlation problem from a multilevel point of view, intending to design and implement an efficient GPU-parallel algorithm for multiple settings, including the edge computing one. We analyzed the accuracy and performance of the proposed algorithm—also considering power efficiency—and its applicability to the identified settings. Results show that a significant speedup of InSAR processing is possible by exploiting GPU computing in different scenarios with no loss of accuracy, also enabling onboard processing using SoC hardware.


Sign in / Sign up

Export Citation Format

Share Document