instruction scheduling Latest Research Papers

High-degree, low-precision polynomial arithmetic is a fundamental computational primitive underlying structured lattice based cryptography. Its algorithmic properties and suitability for implementation on different compute platforms is an active area of research, and this article contributes to this line of work: Firstly, we present memory-efficiency and performance improvements for the Toom-Cook/Karatsuba polynomial multiplication strategy. Secondly, we provide implementations of those improvements on Arm® Cortex®-M4 CPU, as well as the newer Cortex-M55 processor, the first M-profile core implementing the M-profile Vector Extension (MVE), also known as Arm® Helium™ technology. We also implement the Number Theoretic Transform (NTT) on the Cortex-M55 processor. We show that despite being singleissue, in-order and offering only 8 vector registers compared to 32 on A-profile SIMD architectures like Arm® Neon™ technology and the Scalable Vector Extension (SVE), by careful register management and instruction scheduling, we can obtain a 3× to 5× performance improvement over already highly optimized implementations on Cortex-M4, while maintaining a low area and energy profile necessary for use in embedded market. Finally, as a real-world application we integrate our multiplication techniques to post-quantum key-encapsulation mechanism Saber

Download Full-text

Seeds of SEED: Preventing Priority Inversion in Instruction Scheduling to Disrupt Speculative Interference

10.1109/seed51797.2021.00022 ◽

2021 ◽

Author(s):

Christos Sakalis ◽

Magnus Sjalander ◽

Stefanos Kaxiras

Keyword(s):

Instruction Scheduling ◽

Priority Inversion

Download Full-text

Gretch

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3439803 ◽

2021 ◽

Vol 18 (2) ◽

pp. 1-25

Author(s):

Anirudh Mohan Kaushik ◽

Gennady Pekhimenko ◽

Hiren Patel

Keyword(s):

High Performance ◽

Instruction Scheduling ◽

Graph Representation ◽

Specific Information ◽

Graph Analytics ◽

Spatial Locality ◽

Effective Operation ◽

Important Challenge ◽

Temporal And Spatial ◽

Memory Accesses

Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation. In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.

Download Full-text

Certified and efficient instruction scheduling: application to interlocked VLIW processors

Proceedings of the ACM on Programming Languages ◽

10.1145/3428197 ◽

2020 ◽

Vol 4 (OOPSLA) ◽

pp. 1-29

Author(s):

Cyril Six ◽

Sylvain Boulmé ◽

David Monniaux

Keyword(s):

Instruction Scheduling ◽

Vliw Processors

Download Full-text

Novel Implementation Approach with Enhanced Memory Access Performance of MGS Algorithm for VLIW Architecture

Journal of Circuits System and Computers ◽

10.1142/s021812662050200x ◽

2020 ◽

Vol 29 (12) ◽

pp. 2050200

Author(s):

Mohamed Najoui ◽

Anas Hatim ◽

Said Belkouch ◽

Noureddine Chabini

Keyword(s):

Linear Equations ◽

Instruction Scheduling ◽

Least Square ◽

Memory Access ◽

Cache Memory ◽

Parallel Memory ◽

Parallel Version ◽

Implementation Approach ◽

Least Square Problem ◽

Sequential Implementation

Modified Gram–Schmidt (MGS) algorithm is one of the most-known forms of QR decomposition (QRD) algorithms. It has been used in many signal and image processing applications to solve least square problem and linear equations or to invert matrices. However, QRD is well-thought-out as a computationally expensive technique, and its sequential implementation fails to meet the requirements of many real-time applications. In this paper, we suggest a new parallel version of MGS algorithm that uses VLIW (Very Long Instruction Word) resources in an efficient way to get more performance. The presented parallel MGS is based on compact VLIW kernels that have been designed for each algorithm step taking into account architectural and algorithmic constraints. Based on instruction scheduling and software pipelining techniques, the proposed kernels exploit efficiently data, instruction and loop levels parallelism. Additionally, cache memory properties were used efficiently to enhance parallel memory access and to avoid cache misses. The robustness, accuracy and rapidity of the introduced parallel MGS implementation on VLIW enhance significantly the performance of systems under severe rea-time and low power constraints. Experimental results show great improvements over the optimized vendor QRD implementation and the state of art.

Download Full-text

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) ◽

10.1109/hpca47549.2020.00042 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mehdi Alipour ◽

Stefanos Kaxiras ◽

David Black-Schaffer ◽

Rakesh Kumar

Keyword(s):

Instruction Scheduling

Download Full-text

Evolutionary Algorithms for Instruction Scheduling, Operation Merging, and Register Allocation in VLIW Compilers

Journal of Signal Processing Systems ◽

10.1007/s11265-019-01493-2 ◽

2020 ◽

Vol 92 (7) ◽

pp. 655-678

Author(s):

Florian Giesemann ◽

Lukas Gerlach ◽

Guillermo Payá-Vayá

Keyword(s):

Evolutionary Algorithms ◽

Register Allocation ◽

Instruction Scheduling

Download Full-text

Accelerating the Development of DSL Compilers for Specialized Processors

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2020-32(5)-3 ◽

2020 ◽

Vol 32 (5) ◽

pp. 35-56

Author(s):

Pyotr Nikolaevich Sovietov

Keyword(s):

Instruction Scheduling ◽

Agile Development ◽

Instruction Set ◽

Cryptographic Primitives ◽

Computing Systems ◽

Practical Applications ◽

Smt Solver ◽

Domain Specific ◽

Instruction Selection ◽

The Internet Of Things

Specialized processors programmable in domain-specific languages are increasingly used in modern computing systems. The compiler-in-the-loop approach, based on the joint development of a specialized processor and a compiler, is gaining popularity. At the same time, the traditional tools, like GCC and LLVM, are insufficient for the agile development of optimizing compilers that generate target code of an exotic, irregular architecture with static parallelism of operations. The article proposes methods from the field of program synthesis for the implementation of machine-dependent compilation phases. The phases are based on a reduction to SMT problem which allows to get rid of heuristic and approximate approaches, that requires complex software implementation of a compiler. In particular, a synthesis of machine-dependent optimization rules, instruction selection and instruction scheduling combined with register allocation are implemented with help of SMT solver. Practical applications of the developed methods and algorithms are illustrated by the example of a compiler for a specialized processor with an instruction set that accelerates the implementation of lightweight cryptography algorithms in the Internet of Things. The results of compilation and simulation of 8 cryptographic primitives for 3 variants of specialized processor (CISC-like, VLIW-like and a variant with delayed load instruction) show the vitality of the proposed approach.

Download Full-text

Survey on Combinatorial Register Allocation and Instruction Scheduling

ACM Computing Surveys ◽

10.1145/3200920 ◽

2019 ◽

Vol 52 (3) ◽

pp. 1-50

Author(s):

Roberto Castañeda Lozano ◽

Christian Schulte

Keyword(s):

Register Allocation ◽

Instruction Scheduling

Download Full-text

Combinatorial Register Allocation and Instruction Scheduling

ACM Transactions on Programming Languages and Systems ◽

10.1145/3332373 ◽

2019 ◽

Vol 41 (3) ◽

pp. 1-53

Author(s):

Roberto Castañeda Lozano ◽

Mats Carlsson ◽

Gabriel Hjort Blindell ◽

Christian Schulte

Keyword(s):

Register Allocation ◽

Instruction Scheduling

Download Full-text

instruction scheduling
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Polynomial multiplication on embedded vector architectures

Seeds of SEED: Preventing Priority Inversion in Instruction Scheduling to Disrupt Speculative Interference

Gretch

Certified and efficient instruction scheduling: application to interlocked VLIW processors

Novel Implementation Approach with Enhanced Memory Access Performance of MGS Algorithm for VLIW Architecture

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors

Evolutionary Algorithms for Instruction Scheduling, Operation Merging, and Register Allocation in VLIW Compilers

Accelerating the Development of DSL Compilers for Specialized Processors

Survey on Combinatorial Register Allocation and Instruction Scheduling

Combinatorial Register Allocation and Instruction Scheduling

Export Citation Format

instruction schedulingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Polynomial multiplication on embedded vector architectures

Seeds of SEED: Preventing Priority Inversion in Instruction Scheduling to Disrupt Speculative Interference

Gretch

Certified and efficient instruction scheduling: application to interlocked VLIW processors

Novel Implementation Approach with Enhanced Memory Access Performance of MGS Algorithm for VLIW Architecture

Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors

Evolutionary Algorithms for Instruction Scheduling, Operation Merging, and Register Allocation in VLIW Compilers

Accelerating the Development of DSL Compilers for Specialized Processors

Survey on Combinatorial Register Allocation and Instruction Scheduling

Combinatorial Register Allocation and Instruction Scheduling

instruction scheduling
Recently Published Documents