Text compression using FI sequences

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

On Using Goldbach G0 Codes and Even-Rodeh Codes for Text Compression on Using Goldbach G0 Codes and Even-Rodeh Codes for Text Compression

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/180/1/012062 ◽

2017 ◽

Vol 180 ◽

pp. 012062 ◽

Cited By ~ 2

Author(s):

M A Budiman ◽

D Rachmawati

Keyword(s):

Text Compression

Download Full-text

Vague text compression

ACM SIGACT News ◽

10.1145/152992.153009 ◽

1993 ◽

Vol 24 (1) ◽

pp. 68-74

Author(s):

Andrew Davison

Keyword(s):

Text Compression

Download Full-text

Better Adaptive Text Compression Scheme

JOURNAL OF EDUCATION AND SCIENCE ◽

10.33899/edusj.2018.147575 ◽

2018 ◽

Vol 27 (2) ◽

pp. 48-57

Author(s):

Duha Amir Sultan

Keyword(s):

Text Compression ◽

Compression Scheme

Download Full-text

Lossless text compression using GPT-2 language model and Huffman coding

SHS Web of Conferences ◽

10.1051/shsconf/202110204013 ◽

2021 ◽

Vol 102 ◽

pp. 04013

Author(s):

Md. Atiqur Rahman ◽

Mohamed Hamada

Keyword(s):

Data Compression ◽

State Of The Art ◽

Language Model ◽

Huffman Coding ◽

Original Text ◽

Text Compression ◽

Compression Technique ◽

Daily Life Activities ◽

Burrows Wheeler Transform ◽

Compressed Data

Modern daily life activities produced lots of information for the advancement of telecommunication. It is a challenging issue to store them on a digital device or transmit it over the Internet, leading to the necessity for data compression. Thus, research on data compression to solve the issue has become a topic of great interest to researchers. Moreover, the size of compressed data is generally smaller than its original. As a result, data compression saves storage and increases transmission speed. In this article, we propose a text compression technique using GPT-2 language model and Huffman coding. In this proposed method, Burrows-Wheeler transform and a list of keys are used to reduce the original text file’s length. Finally, we apply GPT-2 language mode and then Huffman coding for encoding. This proposed method is compared with the state-of-the-art techniques used for text compression. Finally, we show that the proposed method demonstrates a gain in compression ratio compared to the other state-of-the-art methods.

Download Full-text