scholarly journals Parallel Implementations of ARX-Based Block Ciphers on Graphic Processing Units

Mathematics ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. 1894
Author(s):  
SangWoo An ◽  
YoungBeom Kim ◽  
Hyeokdong Kwon ◽  
Hwajeong Seo ◽  
Seog Chung Seo

With the development of information and communication technology, various types of Internet of Things (IoT) devices have widely been used for convenient services. Many users with their IoT devices request various services to servers. Thus, the amount of users’ personal information that servers need to protect has dramatically increased. To quickly and safely protect users’ personal information, it is necessary to optimize the speed of the encryption process. Since it is difficult to provide the basic services of the server while encrypting a large amount of data in the existing CPU, several parallel optimization methods using Graphics Processing Units (GPUs) have been considered. In this paper, we propose several optimization techniques using GPU for efficient implementation of lightweight block cipher algorithms on the server-side. As the target algorithm, we select high security and light weight (HIGHT), Lightweight Encryption Algorithm (LEA), and revised CHAM, which are Add-Rotate-Xor (ARX)-based block ciphers, because they are used widely on IoT devices. We utilize the features of the counter (CTR) operation mode to reduce unnecessary memory copying and operations in the GPU environment. Besides, we optimize the memory usage by making full use of GPU’s on-chip memory such as registers and shared memory and implement the core function of each target algorithm with inline PTX assembly codes for maximizing the performance. With the application of our optimization methods and handcrafted PTX codes, we achieve excellent encryption throughput of 468, 2593, and 3063 Gbps for HIGHT, LEA, and revised CHAM on RTX 2070 NVIDIA GPU, respectively. In addition, we present optimized implementations of Counter Mode Based Deterministic Random Bit Generator (CTR_DRBG), which is one of the widely used deterministic random bit generators to provide a large amount of random data to the connected IoT devices. We apply several optimization techniques for maximizing the performance of CTR_DRBG, and we achieve 52.2, 24.8, and 34.2 times of performance improvement compared with CTR_DRBG implementation on CPU-side when HIGHT-64/128, LEA-128/128, and CHAM-128/128 are used as underlying block cipher algorithm of CTR_DRBG, respectively.

2020 ◽  
Vol 10 (11) ◽  
pp. 3711 ◽  
Author(s):  
SangWoo An ◽  
Seog Chung Seo

With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users’ personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amount of data without compromising service capabilities in real-time, Graphic Processing Units (GPUs) have been considered as a proper candidate for a crypto accelerator for processing a huge amount of data in this situation. In this paper, we present highly efficient implementations of block ciphers on NVIDIA GPUs (especially, Maxwell, Pascal, and Turing architectures) for environments using massively large data in IoT and Cloud computing applications. As block cipher algorithms, we choose AES, a representative standard block cipher algorithm; LEA, which was recently added in ISO/IEC 29192-2:2019 standard; and CHAM, a recently developed lightweight block cipher algorithm. To maximize the parallelism in the encryption process, we utilize Counter (CTR) mode of operation and customize it by using GPU’s characteristics. We applied several optimization techniques with respect to the characteristics of GPU architecture such as kernel parallelism, memory optimization, and CUDA stream. Furthermore, we optimized each target cipher by considering the algorithmic characteristics of each cipher by implementing the core part of each cipher with handcrafted inline PTX (Parallel Thread eXecution) codes, which are virtual assembly codes in CUDA platforms. With the application of our optimization techniques, in our implementation on RTX 2070 GPU, AES and LEA show up to 310 Gbps and 2.47 Tbps of throughput, respectively, which are 10.7% and 67% improved compared with the 279.86 Gbps and 1.47 Tbps of the previous best result. In the case of CHAM, this is the first optimized implementation on GPUs and it achieves 3.03 Tbps of throughput on RTX 2070 GPU.


Mathematics ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. 1837
Author(s):  
YoungBeom Kim ◽  
Hyeokdong Kwon ◽  
SangWoo An ◽  
Hwajeong Seo ◽  
Seog Chung Seo

As the development of Internet of Things (IoT), the data exchanged through the network has significantly increased. To secure the sensitive data with user’s personal information, it is necessary to encrypt the transmitted data. Since resource-constrained wireless devices are typically used for IoT services, it is required to optimize the performance of cryptographic algorithms which are computation-intensive tasks. In this paper, we present efficient implementations of ARX-based Korean Block Ciphers (HIGHT and LEA) with CounTeR (CTR) mode of operation, and CTR_DRBG, one of the most widely used DRBGs (Deterministic Random Bit Generators), on 8-bit AVR Microcontrollers (MCUs). Since 8-bit AVR MCUs are widely used for various types of IoT devices, we select it as the target platform in this paper. We present an efficient implementation of HIGHT and LEA by making full use of the property of CTR mode, where the nonce value is fixed, and only the counter value changes during the encryption. On our implementation, the cost of additional function calls occurred by the generation of look-up table can be reduced. With respect to CTR_DRBG, we identified several parts that do not need to be computed. Thus, precomputing those parts in offline and using them online can result in performance improvements for CTR_DRBG. Furthermore, we applied several optimization techniques by making full use of target devices’ characteristics with AVR assembly codes on 8-bit AVR MCUs. Our proposed table generation way can reduce the cost for building a precomputation table by around 6.7% and 9.1% in the case of LEA and HIGHT, respectively. Proposed implementations of LEA and HIGHT with CTR mode on 8-bit AVR MCUs provide 6.3% and 3.8% of improved performance, compared with the previous best results, respectively. Our implementations are the fastest compared to previous LEA and HIGHT implementations on 8-bit AVR MCUs. In addition, the proposed CTR_DRBG implementations on AVR provide better performance by 37.2% and 8.7% when the underlying block cipher is LEA and HIGHT, respectively.


Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1548
Author(s):  
Hyeokdong Kwon ◽  
SangWoo An ◽  
YoungBeom Kim ◽  
Hyunji Kim ◽  
Seung Ju Choi ◽  
...  

As the technology of Internet of Things (IoT) evolves, abundant data is generated from sensor nodes and exchanged between them. For this reason, efficient encryption is required to keep data in secret. Since low-end IoT devices have limited computation power, it is difficult to operate expensive ciphers on them. Lightweight block ciphers reduce computation overheads, which are suitable for low-end IoT platforms. In this paper, we implemented the optimized CHAM block cipher in the counter mode of operation, on 8-bit AVR microcontrollers (i.e., representative sensor nodes). There are four new techniques applied. First, the execution time is drastically reduced, by skipping eight rounds through pre-calculation and look-up table access. Second, the encryption with a variable-key scenario is optimized with the on-the-fly table calculation. Third, the encryption in a parallel way makes multiple blocks computed in online for CHAM-64/128 case. Fourth, the state-of-art engineering technique is fully utilized in terms of the instruction level and register level. With these optimization methods, proposed optimized CHAM implementations for counter mode of operation outperformed the state-of-art implementations by 12.8%, 8.9%, and 9.6% for CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively.


Mathematics ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. 1781
Author(s):  
SangWoo An ◽  
Seog Chung Seo

With the development of the Internet of Things (IoT) and cloud computing technology, various cryptographic systems have been proposed to protect increasing personal information. Recently, Post-Quantum Cryptography (PQC) algorithms have been proposed to counter quantum algorithms that threaten public key cryptography. To efficiently use PQC in a server environment dealing with large amounts of data, optimization studies are required. In this paper, we present optimization methods for FrodoKEM and NewHope, which are the NIST PQC standardization round 2 competition algorithms in the Graphics Processing Unit (GPU) platform. For each algorithm, we present a part that can perform parallel processing of major operations with a large computational load using the characteristics of the GPU. In the case of FrodoKEM, we introduce parallel optimization techniques for matrix generation operations and matrix arithmetic operations such as addition and multiplication. In the case of NewHope, we present a parallel processing technique for polynomial-based operations. In the encryption process of FrodoKEM, the performance improvements have been confirmed up to 5.2, 5.75, and 6.47 times faster than the CPU implementation in FrodoKEM-640, FrodoKEM-976, and FrodoKEM-1344, respectively. In the encryption process of NewHope, the performance improvements have been shown up to 3.33 and 4.04 times faster than the CPU implementation in NewHope-512 and NewHope-1024, respectively. The results of this study can be used in the IoT devices server or cloud computing service server. In addition, the results of this study can be utilized in image processing technologies such as facial recognition technology.


Author(s):  
Sergio Roldán Lombardía ◽  
Fatih Balli ◽  
Subhadeep Banik

AbstractRecently, cryptographic literature has seen new block cipher designs such as , or that aim to be more lightweight than the current standard, i.e., . Even though family of block ciphers were designed two decades ago, they still remain as the de facto encryption standard, with being the most widely deployed variant. In this work, we revisit the combined one-in-all implementation of the family, namely both encryption and decryption of each as a single ASIC circuit. A preliminary version appeared in Africacrypt 2019 by Balli and Banik, where the authors design a byte-serial circuit with such functionality. We improve on their work by reducing the size of the compact circuit to 2268 GE through 1-bit-serial implementation, which achieves 38% reduction in area. We also report stand-alone bit-serial versions of the circuit, targeting only a subset of modes and versions, e.g., and . Our results imply that, in terms of area, and can easily compete with the larger members of recently designed family, e.g., , . Thus, our implementations can be used interchangeably inside authenticated encryption candidates such as , or in place of .


2021 ◽  
Vol 11 (11) ◽  
pp. 4776
Author(s):  
Kyungbae Jang ◽  
Gyeongju Song ◽  
Hyunjun Kim ◽  
Hyeokdong Kwon ◽  
Hyunji Kim ◽  
...  

Grover search algorithm is the most representative quantum attack method that threatens the security of symmetric key cryptography. If the Grover search algorithm is applied to symmetric key cryptography, the security level of target symmetric key cryptography can be lowered from n-bit to n2-bit. When applying Grover’s search algorithm to the block cipher that is the target of potential quantum attacks, the target block cipher must be implemented as quantum circuits. Starting with the AES block cipher, a number of works have been conducted to optimize and implement target block ciphers into quantum circuits. Recently, many studies have been published to implement lightweight block ciphers as quantum circuits. In this paper, we present optimal quantum circuit designs of symmetric key cryptography, including PRESENT and GIFT block ciphers. The proposed method optimized PRESENT and GIFT block ciphers by minimizing qubits, quantum gates, and circuit depth. We compare proposed PRESENT and GIFT quantum circuits with other results of lightweight block cipher implementations in quantum circuits. Finally, quantum resources of PRESENT and GIFT block ciphers required for the oracle of the Grover search algorithm were estimated.


Author(s):  
Indar Sugiarto ◽  
Doddy Prayogo ◽  
Henry Palit ◽  
Felix Pasila ◽  
Resmana Lim ◽  
...  

This paper describes a prototype of a computing platform dedicated to artificial intelligence explorations. The platform, dubbed as PakCarik, is essentially a high throughput computing platform with GPU (graphics processing units) acceleration. PakCarik is an Indonesian acronym for Platform Komputasi Cerdas Ramah Industri Kreatif, which can be translated as “Creative Industry friendly Intelligence Computing Platform”. This platform aims to provide complete development and production environment for AI-based projects, especially to those that rely on machine learning and multiobjective optimization paradigms. The method for constructing PakCarik was based on a computer hardware assembling technique that uses commercial off-the-shelf hardware and was tested on several AI-related application scenarios. The testing methods in this experiment include: high-performance lapack (HPL) benchmarking, message passing interface (MPI) benchmarking, and TensorFlow (TF) benchmarking. From the experiment, the authors can observe that PakCarik's performance is quite similar to the commonly used cloud computing services such as Google Compute Engine and Amazon EC2, even though falls a bit behind the dedicated AI platform such as Nvidia DGX-1 used in the benchmarking experiment. Its maximum computing performance was measured at 326 Gflops. The authors conclude that PakCarik is ready to be deployed in real-world applications and it can be made even more powerful by adding more GPU cards in it.


2014 ◽  
pp. 153-177
Author(s):  
Sima Noghanian ◽  
Abas Sabouni ◽  
Travis Desell ◽  
Ali Ashtari

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Jingmin Zhang ◽  
Xiaokui Yue ◽  
Xuan Li ◽  
Haofei Zhang ◽  
Tao Ni ◽  
...  

This article focuses on the simultaneous wireless information and power transfer (SWIPT) systems, which provide both the power supply and the communications for Internet-of-Things (IoT) devices in the sixth-generation (6G) network. Due to the extremely stringent requirements on reliability, speed, and security in the 6G network, aerial access networks (AANs) are deployed to extend the coverage of wireless communications and guarantee robustness. Moreover, sparse code multiple access (SCMA) is implemented on the SWIPT system to further promote the spectrum efficiency. To improve the speed and security of SWIPT systems in 6G AANs, we have developed an optimization algorithm of SCMA to maximize the secrecy sum rate (SSR). Specifically, a power-splitting (PS) strategy is applied by each user to coordinate its energy harvesting and information decoding. Hence, the SSR maximization problems in the SCMA system are formulated in terms of the PS and resource allocation, under the constraints on the minimum rates and minimum harvested energy of individual users. Then, a successive convex approximation method is introduced to transform the nonconvex problems to the convex ones, which are then solved by an iterative algorithm. In addition, we investigate the SSR performance of the SCMA system supported by our optimization methods, when the impacts from different perspectives are considered. Our studies and simulation results show that the SCMA system supported by our proposed optimization algorithms significantly outperforms the legacy system with uniform power allocation and fixed PS.


Sign in / Sign up

Export Citation Format

Share Document