scholarly journals Challenges of System-Level Simulations and Performance Evaluation for 5G Wireless Networks

IEEE Access ◽  
2014 ◽  
Vol 2 ◽  
pp. 1553-1561 ◽  
Author(s):  
Ying Wang ◽  
Jing Xu ◽  
Lisi Jiang
2018 ◽  
Vol 15 (3) ◽  
pp. 1286-1297 ◽  
Author(s):  
Fei Ding ◽  
Aiguo Song ◽  
Dengyin Zhang ◽  
En Tong ◽  
Zhiwen Pan ◽  
...  

2010 ◽  
Vol 20 (02) ◽  
pp. 103-121 ◽  
Author(s):  
MOSTAFA I. SOLIMAN ◽  
ABDULMAJID F. Al-JUNAID

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.


Author(s):  
Jasmine Araújo ◽  
Josiane Rodrigues ◽  
Simone Fraiha ◽  
Hermínio Gomes ◽  
João C W A Costa ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document