Lossless text compression using GPT-2 language model and Huffman coding

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Download Full-text

Burrows–Wheeler Transform Based Lossless Text Compression Using Keys and Huffman Coding

Symmetry ◽

10.3390/sym12101654 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1654

Author(s):

Md. Atiqur Rahman ◽

Mohamed Hamada

Keyword(s):

Compression Ratio ◽

State Of The Art ◽

Huffman Coding ◽

Text Compression ◽

Small Problem ◽

Research Fields ◽

Huffman Encoding ◽

Significant Research ◽

Burrows Wheeler Transform ◽

Use Of Internet

Text compression is one of the most significant research fields, and various algorithms for text compression have already been developed. This is a significant issue, as the use of internet bandwidth is considerably increasing. This article proposes a Burrows–Wheeler transform and pattern matching-based lossless text compression algorithm that uses Huffman coding in order to achieve an excellent compression ratio. In this article, we introduce an algorithm with two keys that are used in order to reduce more frequently repeated characters after the Burrows–Wheeler transform. We then find patterns of a certain length from the reduced text and apply Huffman encoding. We compare our proposed technique with state-of-the-art text compression algorithms. Finally, we conclude that the proposed technique demonstrates a gain in compression ratio when compared to other compression techniques. A small problem with our proposed method is that it does not work very well for symmetric communications like Brotli.

Download Full-text

Design of Effective Lossless Data Compression Technique for Multiple Genomic DNA Sequences

10.54216/fpa.060103 ◽

2021 ◽

pp. 17-25

Author(s):

Mahmud Alosta ◽

◽

Alireza Souri

Keyword(s):

Data Compression ◽

Dna Sequences ◽

Genomic Dna ◽

Arithmetic Coding ◽

Huffman Coding ◽

Compression Technique ◽

Compression Performance ◽

Proposed Model ◽

Evaluation Parameters ◽

Genome Dataset

In recent years, a massive amount of genomic DNA sequences are being created which leads to the development of new storing and archiving methods. There is a major challenge to process, store or transmit the huge volume of DNA sequences data. To lessen the number of bits needed to store and transmit data, data compression (DC) techniques are proposed. Recently, DC becomes more popular, and large number of techniques is proposed with applications in several domains. In this paper, a lossless compression technique named Arithmetic coding is employed to compress DNA sequences. In order to validate the performance of the proposed model, the artificial genome dataset is used and the results are investigated interms of different evaluation parameters. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared to Huffman coding, LZW coding, and LZMA techniques. From simulation results, it is clear that the Arithmetic coding achieves significantly better compression with a compression ratio of 0.261 at the bit rate of 2.16 bpc.

Download Full-text

Study on Data Compression Algorithm Based on Semantic Analysis

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.842.712 ◽

2013 ◽

Vol 842 ◽

pp. 712-716

Author(s):

Qi Hong ◽

Xiao Lei Lu

Keyword(s):

Data Compression ◽

Semantic Analysis ◽

Traditional Approach ◽

Basic Element ◽

Compression Algorithm ◽

Huffman Coding ◽

Text Compression ◽

C Language ◽

Lossless Data Compression ◽

Compression Coding

As a lossless data compression coding, Huffman coding is widely used in text compression. Nevertheless, the traditional approach has some deficiencies. For example, same compression on all characters may overlook the particularity of keywords and special statements as well as the regularity of some statements. In terms of this situation, a new data compression algorithm based on semantic analysis is proposed in this paper. The new kind of method, which takes C language keywords as the basic element, is created for solving the text compression of source files of C language. The results of experiment show that the compression ratio has been improved by 150 percent roughly in this way. This method can be promoted to apply to text compression of the constrained-language.

Download Full-text

USE OF NEW EFFICIENT LOSSLESS DATA COMPRESSION METHOD IN TRANSMITTING ENCRYPTED BAPTISTA SYMMETRIC CHAOTIC CRYPTOSYSTEM DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8976 ◽

2016 ◽

Vol 78 (6-4) ◽

Author(s):

Muhamad Azlan Daud ◽

Muhammad Rezal Kamel Ariffin ◽

S. Kularajasingam ◽

Che Haziqah Che Hussin ◽

Nurliyana Juhan ◽

...

Keyword(s):

Data Compression ◽

Data Transmission ◽

Compression Algorithm ◽

Huffman Coding ◽

Compression Method ◽

Compression Technique ◽

Chaotic Dynamical System ◽

Fast Encoding ◽

Compression Mechanism ◽

Lossless Data Compression

A new compression algorithm used to ensure a modified Baptista symmetric cryptosystem which is based on a chaotic dynamical system to be applicable is proposed. The Baptista symmetric cryptosystem able to produce various ciphers responding to the same message input. This modified Baptista type cryptosystem suffers from message expansion that goes against the conventional methodology of a symmetric cryptosystem. A new lossless data compression algorithm based on theideas from the Huffman coding for data transmission is proposed.This new compression mechanism does not face the problem of mapping elements from a domain which is much larger than its range.Our new algorithm circumvent this problem via a pre-defined codeword list. The purposed algorithm has fast encoding and decoding mechanism and proven analytically to be a lossless data compression technique.

Download Full-text

Implementasi Algoritma Kompresi Data Huffman Untuk Memperkecil Ukuran File MP3 Player

JURNAL TEKNIK INFORMATIKA ◽

10.15408/jti.v2i1.8 ◽

2011 ◽

Vol 2 (1) ◽

Author(s):

Victor Amrizal

Keyword(s):

Frequency Distribution ◽

Data Transfer ◽

Huffman Coding ◽

Original Text ◽

Text Compression ◽

Huffman Code ◽

Code Length ◽

Transfer Time ◽

Compression Process ◽

Mp3 Player

Process to minimize file is by undertaking compression to that file. Text compression process aims to reduce symbol purpose repeat or character that arrange text by mengkodekan symbols or that character so room the need storage can be reduced and data Transfer time can faster. Text compression process can be done by mengkodekan segments of original text is next to be placed deep lexical. Process compression can be done by various algorithm media, amongst those Coding's Huffman that constitutes one of tech compression which involve frequency distribution a symbol to form unique code. Symbol frequency distribution will regard long its Huffman code, progressively frequent that symbol texts deep appearance therefore Huffman code length that resulting will get short. This method mengkodekan symbols or characters with binary treed help by merges two character emergence frequencies smallest until molded treed codes. Keywords:. Huffman Coding, kompresi data, algoritma.

Download Full-text

A data compression technique based on reversed leading bits coding and Huffman coding

2015 10th International Conference on Communications and Networking in China (ChinaCom) ◽

10.1109/chinacom.2015.7497980 ◽

2015 ◽

Author(s):

Haoqi Ren

Keyword(s):

Data Compression ◽

Huffman Coding ◽

Compression Technique

Download Full-text

A Survey on Image Compression using Burrows Wheeler Transform

IRA-International Journal of Technology & Engineering (ISSN 2455-4480) ◽

10.21013/jte.v6.n2.p3 ◽

2017 ◽

Vol 6 (2) ◽

pp. 29

Author(s):

Manasi Rath ◽

Suvendu Rup

Keyword(s):

Image Compression ◽

Review Paper ◽

Lossless Compression ◽

Text Compression ◽

Compression Technique ◽

Methodological Review ◽

A New Technique ◽

High Level ◽

Burrows Wheeler Transform ◽

Compression Field

<em>This paper is a methodological review paper on image compression using Burrows Wheeler Transform. Normally BWT is used for text compression but it has been recently applied to image compression field. Basically it is a lossless compression technique which is used for high level resolution.This paper proposes about several scheme added with BWT to improve the performance of image compression which helps us to formulate a new technique for the further improvement in BWT. Here many authorshave different type of representation of BWT for better compression.</em>

Download Full-text

An efficient secure data compression technique based on chaos and adaptive Huffman coding

Peer-to-Peer Networking and Applications ◽

10.1007/s12083-020-00981-8 ◽

2020 ◽

Author(s):

Muhammad Usama ◽

Qutaibah M. Malluhi ◽

Nordin Zakaria ◽

Imran Razzak ◽

Waheed Iqbal

Keyword(s):

Data Compression ◽

Data Encryption ◽

Single Step ◽

Huffman Coding ◽

Security Measures ◽

Secret Key ◽

Compression Process ◽

Compression Technique ◽

Communication Time ◽

Secure Data

AbstractData stored in physical storage or transferred over a communication channel includes substantial redundancy. Compression techniques cut down the data redundancy to reduce space and communication time. Nevertheless, compression techniques lack proper security measures, e.g., secret key control, leaving the data susceptible to attack. Data encryption is therefore needed to achieve data security in keeping the data unreadable and unaltered through a secret key. This work concentrates on the problems of data compression and encryption collectively without negatively affecting each other. Towards this end, an efficient, secure data compression technique is introduced, which provides cryptographic capabilities for use in combination with an adaptive Huffman coding, pseudorandom keystream generator, and S-Box to achieve confusion and diffusion properties of cryptography into the compression process and overcome the performance issues. Thus, compression is carried out according to a secret key such that the output will be both encrypted and compressed in a single step. The proposed work demonstrated a congruent fit for real-time implementation, providing robust encryption quality and acceptable compression capability. Experiment results are provided to show that the proposed technique is efficient and produces similar space-saving (%) to standard techniques. Security analysis discloses that the proposed technique is susceptible to the secret key and plaintext. Moreover, the ciphertexts produced by the proposed technique successfully passed all NIST tests, which confirm that the 99% confidence level on the randomness of the ciphertext.

Download Full-text