A cost-effective VLSI architecture for anisotropic texture filtering in limited memory bandwidth

This paper presents a parallel TBB-CUDA implementation for the acceleration of single-Gaussian distribution model, which is effective for background removal in the video-based fire detection system. In this framework, TBB mainly deals with initializing work of the estimated Gaussian model running on CPU, and CUDA performs background removal and adaption of the model running on GPU. This implementation can exploit the combined computation power of TBB-CUDA, which can be applied to the real-time environment. Over 220 video sequences are utilized in the experiments. The experimental results illustrate that TBB+CUDA can achieve a higher speedup than both TBB and CUDA. The proposed framework can effectively overcome the disadvantages of limited memory bandwidth and few execution units of CPU, and it reduces data transfer latency and memory latency between CPU and GPU.

Download Full-text

A Cost-Effective VLSI Architecture of VLD for MPEG-2 and AVS

Multimedia and Expo, 2007 IEEE International Conference on ◽

10.1109/icme.2007.4284976 ◽

2007 ◽

Author(s):

Yanmei Qu ◽

Yu Li ◽

Yun He ◽

Shunliang Mei

Keyword(s):

Vlsi Architecture ◽

Cost Effective

Download Full-text

Ultra-fine pitch Package on Package solution for high bandwidth mobile applications

Additional Conferences (Device Packaging HiTEC HiTEN & CICMT) ◽

10.4071/2013dpc-tha14 ◽

2013 ◽

Vol 2013 (DPC) ◽

pp. 001870-001893

Author(s):

Rajesh Katkar ◽

Zhijun Zhao ◽

Ron Zhang ◽

Rey Co ◽

Laura Mirkarimi

Keyword(s):

Flip Chip ◽

Cost Effective ◽

Temperature Cycling ◽

Memory Bandwidth ◽

Free Standing ◽

Element Analysis ◽

Fine Pitch ◽

Solder Balls ◽

High Bandwidth ◽

Reliability Performance

Existing Package-on-Package (PoP) solutions are rapidly approaching the logic memory bandwidth capacity in the multi-core mobile processor packages. Package on Package stack using the conventional solder balls has a serious pitch limitation below 400um. The through mold via interconnects may reduce the pitch to 300um; however, this technology is believed to reach a limitation below 300 um. Other approaches including the use of PCB interposers between the Logic and the memory face similar challenges; however, they are cumbersome in the assembly process and expensive. Although Through Silicon Via stacking is expected to achieve the ultimate high bandwidth required to support multi-core mobile processors, the technology must overcome the challenges in process, infrastructure, supply chain and cost. Bond Via Array (BVA) technology addresses all of these issues while enabling high bandwidth PoP stacking of more than 1000 high aspect ratio interconnects at less than 200um within the standard package footprint. BVA is a cost effective, ultra-fine pitch, high density PoP stacking solution that will assist in driving high logic-memory bandwidth applications with standard assembly equipment and processes. This is achieved by encapsulating the logic package after forming free-standing wire bonds along the periphery of its flip chip substrate. The wire protrusions formed above the mold cap at the top of the package are then connected to the BGA at the bottom of the memory package during a standard reflow operation. In this work, the initial evaluation test vehicle with 432 PoP interconnects at 240um pitch within a standard 14 x 14mm package foot print is demonstrated. The important technological challenges we overcame to fabricate the first prototypes will be discussed. The reliability performance describing the temperature cycling, high temperature storage, autoclave and drop testing will be discussed. Finite element analysis modeling used to optimize the package structure will be presented.

Download Full-text

High-throughput block-matching VLSI architecture with low memory bandwidth

IEEE Transactions on Circuits and Systems II Analog and Digital Signal Processing ◽

10.1109/82.663808 ◽

1998 ◽

Vol 45 (4) ◽

pp. 508-512 ◽

Cited By ~ 8

Author(s):

Seung Hyun Nam ◽

Moon Key Lee

Keyword(s):

High Throughput ◽

Vlsi Architecture ◽

Block Matching ◽

Memory Bandwidth

Download Full-text

An Improved Three-Step Hierarchical Motion Estimation Algorithm and Its Cost-Effective VLSI Architecture

Advances in Multimedia Information Processing – PCM 2007 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-77255-2_98 ◽

2007 ◽

pp. 822-830

Author(s):

Hai Bing Yin ◽

Zhe Lei Xia ◽

Xi Zhong Lou

Keyword(s):

Motion Estimation ◽

Vlsi Architecture ◽

Cost Effective ◽

Estimation Algorithm ◽

Motion Estimation Algorithm

Download Full-text

Algorithm and VLSI Architecture Designs of a Lossless Embedded Compression Encoder for HD Video Coding Systems

Journal of Circuits System and Computers ◽

10.1142/s021812662130004x ◽

2020 ◽

pp. 2130004

Author(s):

Yu-Hsuan Lee ◽

Cheng-Hung Kuei ◽

Yue-Zhan Kao ◽

Shih-Song Fan Jiang

Keyword(s):

Video Coding ◽

Vlsi Architecture ◽

Visual Quality ◽

Memory Bandwidth ◽

Coding System ◽

Maximum Throughput ◽

Coding Systems ◽

Binary Coding ◽

Embedded Compression ◽

Hardware Efficiency

The demand for visual quality has been advanced by high display resolutions and frame rates. Nevertheless, these two issues have caused tremendous memory bandwidth in a video coding system. In this study, an efficient lossless embedded compression (EC) algorithm is proposed to save memory bandwidth, while keeping visual quality. The proposed lossless EC algorithm incorporates three core techniques: tree partition, half-pixel prediction and group-based binary coding. Tree partition classifies a [Formula: see text] block into Trunk, Branch and Leaf. With tree partition, half-pixel prediction produces individual residues for Trunk, Branch and Leaf. Group-based binary coding converts theses residues to efficient codewords. The lossless compression ratio (CR) of the proposed EC is as high as 2.24 on average, saving memory bandwidth by 55.4%. This EC algorithm is implemented using CMOS 0.18[Formula: see text][Formula: see text]m technology. The maximum throughput can reach 6.4[Formula: see text]Gpixels/s, which can accommodate [Formula: see text]@60fps. The experiment results demonstrate that this study presents better hardware efficiency of 337[Formula: see text]Gpixels/J and 83.5[Formula: see text]Kpixels/s/gate.

Download Full-text

AN EFFECTIVE APPROACH OF BILATERAL FILTER IMPLEMENTATION IN SPARTAN-3 FIELD PROGRAMMABLE GATE ARRAY

Graduate Research in Engineering and Technology ◽

10.47893/gret.2013.1005 ◽

2013 ◽

pp. 17-20

Author(s):

P. KARTHIKEYAN ◽

M. GAUTHAM ◽

R. RAMAKRISHNAN ◽

A. MOOKKAIYA

Keyword(s):

Field Programmable Gate Array ◽

High Performance ◽

Large Scale ◽

Vlsi Architecture ◽

Cost Effective ◽

Bilateral Filter ◽

Bilateral Filtering ◽

High Resource ◽

Field Programmable ◽

Gate Array

This paper presents the Field Programmable Gate Array (FPGA) implementation of Bilateral Filter, in order to achieve high performance and low power consumption. Bilateral filtering is a technique to smooth images while preserving edges by means of a nonlinear combination of nearby image values. This method is nonlinear, local, and simple. We give an idea that bilateral filtering can be accelerated by bilateral grid scheme that enables fast edge-aware image processing. Nowadays, most of the applications require real time hardware systems with large computing potentiality for which fast and dedicated Very Large Scale Integration (VLSI) architecture appears to be the best possible solution. While it ensures high resource utilization, that too in cost effective platforms like FPGA, designing such architecture does offers some flexibilities like speeding up the computation by adapting more pipelined structures and parallel processing possibilities of reduced memory consumptions. Here we have developed an effective approach of bilateral filter implementation in Spartan-3 FPGA.

Download Full-text