A cost-effective VLSI architecture for anisotropic texture filtering in limited memory bandwidth

Author(s):  
Hyun-Chul Shin ◽  
Jin-Aeon Lee ◽  
Lee-Sup Kim
2014 ◽  
Vol 2014 ◽  
pp. 1-6 ◽  
Author(s):  
Fan Wang ◽  
Xiao Jiang ◽  
Xiao Peng Hu

This paper presents a parallel TBB-CUDA implementation for the acceleration of single-Gaussian distribution model, which is effective for background removal in the video-based fire detection system. In this framework, TBB mainly deals with initializing work of the estimated Gaussian model running on CPU, and CUDA performs background removal and adaption of the model running on GPU. This implementation can exploit the combined computation power of TBB-CUDA, which can be applied to the real-time environment. Over 220 video sequences are utilized in the experiments. The experimental results illustrate that TBB+CUDA can achieve a higher speedup than both TBB and CUDA. The proposed framework can effectively overcome the disadvantages of limited memory bandwidth and few execution units of CPU, and it reduces data transfer latency and memory latency between CPU and GPU.


2013 ◽  
Vol 2013 (DPC) ◽  
pp. 001870-001893
Author(s):  
Rajesh Katkar ◽  
Zhijun Zhao ◽  
Ron Zhang ◽  
Rey Co ◽  
Laura Mirkarimi

Existing Package-on-Package (PoP) solutions are rapidly approaching the logic memory bandwidth capacity in the multi-core mobile processor packages. Package on Package stack using the conventional solder balls has a serious pitch limitation below 400um. The through mold via interconnects may reduce the pitch to 300um; however, this technology is believed to reach a limitation below 300 um. Other approaches including the use of PCB interposers between the Logic and the memory face similar challenges; however, they are cumbersome in the assembly process and expensive. Although Through Silicon Via stacking is expected to achieve the ultimate high bandwidth required to support multi-core mobile processors, the technology must overcome the challenges in process, infrastructure, supply chain and cost. Bond Via Array (BVA) technology addresses all of these issues while enabling high bandwidth PoP stacking of more than 1000 high aspect ratio interconnects at less than 200um within the standard package footprint. BVA is a cost effective, ultra-fine pitch, high density PoP stacking solution that will assist in driving high logic-memory bandwidth applications with standard assembly equipment and processes. This is achieved by encapsulating the logic package after forming free-standing wire bonds along the periphery of its flip chip substrate. The wire protrusions formed above the mold cap at the top of the package are then connected to the BGA at the bottom of the memory package during a standard reflow operation. In this work, the initial evaluation test vehicle with 432 PoP interconnects at 240um pitch within a standard 14 x 14mm package foot print is demonstrated. The important technological challenges we overcame to fabricate the first prototypes will be discussed. The reliability performance describing the temperature cycling, high temperature storage, autoclave and drop testing will be discussed. Finite element analysis modeling used to optimize the package structure will be presented.


Author(s):  
Yu-Hsuan Lee ◽  
Cheng-Hung Kuei ◽  
Yue-Zhan Kao ◽  
Shih-Song Fan Jiang

The demand for visual quality has been advanced by high display resolutions and frame rates. Nevertheless, these two issues have caused tremendous memory bandwidth in a video coding system. In this study, an efficient lossless embedded compression (EC) algorithm is proposed to save memory bandwidth, while keeping visual quality. The proposed lossless EC algorithm incorporates three core techniques: tree partition, half-pixel prediction and group-based binary coding. Tree partition classifies a [Formula: see text] block into Trunk, Branch and Leaf. With tree partition, half-pixel prediction produces individual residues for Trunk, Branch and Leaf. Group-based binary coding converts theses residues to efficient codewords. The lossless compression ratio (CR) of the proposed EC is as high as 2.24 on average, saving memory bandwidth by 55.4%. This EC algorithm is implemented using CMOS 0.18[Formula: see text][Formula: see text]m technology. The maximum throughput can reach 6.4[Formula: see text]Gpixels/s, which can accommodate [Formula: see text]@60fps. The experiment results demonstrate that this study presents better hardware efficiency of 337[Formula: see text]Gpixels/J and 83.5[Formula: see text]Kpixels/s/gate.


Author(s):  
P. KARTHIKEYAN ◽  
M. GAUTHAM ◽  
R. RAMAKRISHNAN ◽  
A. MOOKKAIYA

This paper presents the Field Programmable Gate Array (FPGA) implementation of Bilateral Filter, in order to achieve high performance and low power consumption. Bilateral filtering is a technique to smooth images while preserving edges by means of a nonlinear combination of nearby image values. This method is nonlinear, local, and simple. We give an idea that bilateral filtering can be accelerated by bilateral grid scheme that enables fast edge-aware image processing. Nowadays, most of the applications require real time hardware systems with large computing potentiality for which fast and dedicated Very Large Scale Integration (VLSI) architecture appears to be the best possible solution. While it ensures high resource utilization, that too in cost effective platforms like FPGA, designing such architecture does offers some flexibilities like speeding up the computation by adapting more pipelined structures and parallel processing possibilities of reduced memory consumptions. Here we have developed an effective approach of bilateral filter implementation in Spartan-3 FPGA.


2018 ◽  
Vol 120 ◽  
pp. 32-43 ◽  
Author(s):  
Vicent Selfa ◽  
Julio Sahuquillo ◽  
María E. Gómez ◽  
Crispín Gómez

Sign in / Sign up

Export Citation Format

Share Document