scholarly journals Lossless text compression using GPT-2 language model and Huffman coding

2021 ◽  
Vol 102 ◽  
pp. 04013
Author(s):  
Md. Atiqur Rahman ◽  
Mohamed Hamada

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Symmetry ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 1654
Author(s):  
Md. Atiqur Rahman ◽  
Mohamed Hamada

Text compression is one of the most significant research fields, and various algorithms for text compression have already been developed. This is a significant issue, as the use of internet bandwidth is considerably increasing. This article proposes a Burrows–Wheeler transform and pattern matching-based lossless text compression algorithm that uses Huffman coding in order to achieve an excellent compression ratio. In this article, we introduce an algorithm with two keys that are used in order to reduce more frequently repeated characters after the Burrows–Wheeler transform. We then find patterns of a certain length from the reduced text and apply Huffman encoding. We compare our proposed technique with state-of-the-art text compression algorithms. Finally, we conclude that the proposed technique demonstrates a gain in compression ratio when compared to other compression techniques. A small problem with our proposed method is that it does not work very well for symmetric communications like Brotli.


2021 ◽  
pp. 17-25
Author(s):  
Mahmud Alosta ◽  
◽  
◽  
Alireza Souri

In recent years, a massive amount of genomic DNA sequences are being created which leads to the development of new storing and archiving methods. There is a major challenge to process, store or transmit the huge volume of DNA sequences data. To lessen the number of bits needed to store and transmit data, data compression (DC) techniques are proposed. Recently, DC becomes more popular, and large number of techniques is proposed with applications in several domains. In this paper, a lossless compression technique named Arithmetic coding is employed to compress DNA sequences. In order to validate the performance of the proposed model, the artificial genome dataset is used and the results are investigated interms of different evaluation parameters. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared to Huffman coding, LZW coding, and LZMA techniques. From simulation results, it is clear that the Arithmetic coding achieves significantly better compression with a compression ratio of 0.261 at the bit rate of 2.16 bpc.


2013 ◽  
Vol 842 ◽  
pp. 712-716
Author(s):  
Qi Hong ◽  
Xiao Lei Lu

As a lossless data compression coding, Huffman coding is widely used in text compression. Nevertheless, the traditional approach has some deficiencies. For example, same compression on all characters may overlook the particularity of keywords and special statements as well as the regularity of some statements. In terms of this situation, a new data compression algorithm based on semantic analysis is proposed in this paper. The new kind of method, which takes C language keywords as the basic element, is created for solving the text compression of source files of C language. The results of experiment show that the compression ratio has been improved by 150 percent roughly in this way. This method can be promoted to apply to text compression of the constrained-language.


2016 ◽  
Vol 78 (6-4) ◽  
Author(s):  
Muhamad Azlan Daud ◽  
Muhammad Rezal Kamel Ariffin ◽  
S. Kularajasingam ◽  
Che Haziqah Che Hussin ◽  
Nurliyana Juhan ◽  
...  

A new compression algorithm used to ensure a modified Baptista symmetric cryptosystem which is based on a chaotic dynamical system to be applicable is proposed. The Baptista symmetric cryptosystem able to produce various ciphers responding to the same message input. This modified Baptista type cryptosystem suffers from message expansion that goes against the conventional methodology of a symmetric cryptosystem. A new lossless data compression algorithm based on theideas from the Huffman coding for data transmission is proposed.This new compression mechanism does not face the problem of mapping elements from a domain which is much larger than its range.Our new algorithm circumvent this problem via a pre-defined codeword list.  The purposed algorithm has fast encoding and decoding mechanism and proven analytically to be a lossless data compression technique.


2011 ◽  
Vol 2 (1) ◽  
Author(s):  
Victor Amrizal

Process to minimize file is by undertaking compression to that file. Text compression process aims to reduce symbol purpose repeat or character that arrange text by mengkodekan symbols or that character so room the need storage can be reduced and data Transfer time can faster. Text compression process can be done by mengkodekan segments of original text is next to be placed deep lexical. Process compression can be done by various algorithm media, amongst those Coding's Huffman that constitutes one of tech compression which involve frequency distribution a symbol to form unique code. Symbol frequency distribution will regard long its Huffman code, progressively frequent that symbol texts deep appearance therefore Huffman code length that resulting will get short. This method mengkodekan symbols or characters with binary treed help by merges two character emergence frequencies smallest until molded treed codes.   Keywords:. Huffman Coding, kompresi data, algoritma.


Author(s):  
Manasi Rath ◽  
Suvendu Rup

<em>This paper is a methodological review paper on image compression using Burrows Wheeler Transform. Normally BWT is used for text compression but it has been recently applied to image compression field. Basically it is a lossless compression technique which is used for high level resolution.This paper proposes about several scheme added with BWT to improve the performance of image compression which helps us to formulate a new technique for the further improvement in BWT. Here many authorshave different type of representation of BWT for better compression.</em>


Author(s):  
Muhammad Usama ◽  
Qutaibah M. Malluhi ◽  
Nordin Zakaria ◽  
Imran Razzak ◽  
Waheed Iqbal

AbstractData stored in physical storage or transferred over a communication channel includes substantial redundancy. Compression techniques cut down the data redundancy to reduce space and communication time. Nevertheless, compression techniques lack proper security measures, e.g., secret key control, leaving the data susceptible to attack. Data encryption is therefore needed to achieve data security in keeping the data unreadable and unaltered through a secret key. This work concentrates on the problems of data compression and encryption collectively without negatively affecting each other. Towards this end, an efficient, secure data compression technique is introduced, which provides cryptographic capabilities for use in combination with an adaptive Huffman coding, pseudorandom keystream generator, and S-Box to achieve confusion and diffusion properties of cryptography into the compression process and overcome the performance issues. Thus, compression is carried out according to a secret key such that the output will be both encrypted and compressed in a single step. The proposed work demonstrated a congruent fit for real-time implementation, providing robust encryption quality and acceptable compression capability. Experiment results are provided to show that the proposed technique is efficient and produces similar space-saving (%) to standard techniques. Security analysis discloses that the proposed technique is susceptible to the secret key and plaintext. Moreover, the ciphertexts produced by the proposed technique successfully passed all NIST tests, which confirm that the 99% confidence level on the randomness of the ciphertext.


Sign in / Sign up

Export Citation Format

Share Document