Experimental results in Prediction by Partial Matching and Star transformation applied in lossless compression of text files

2019 ◽

pp. 25-38

Author(s):

И.В. Селиванова ◽

I.V. Selivanova ◽

Д.В. Косяков ◽

D.V. Kosyakov ◽

А.Е. Гуськов ◽

...

Keyword(s):

Partial Matching ◽

Prediction By Partial Matching

Исследуется возможность установления смысловой близости научных текстов методом их автоматической классификации, основанным на сжатии аннотаций. Идея метода состоит в том, что алгоритмы компрессии типа PPM (prediction by partial matching) сжимают терминологически близкие тексты существенно лучше, чем далекие. Если для каждой классифицируемой тематики будет сформировано ядро публикаций (аналог обучающей выборки), то наилучшая доля сжатия будет указывать на принадлежность классифицируемого текста к соответствующей тематике. Было определено 30 тематических категорий, каждой из них в базе данных Scopus получены аннотации около 500 публикаций, из которых разными способами выбирались 100 аннотаций для ядра и 20 аннотаций для тестирования. Установлено, что построение ядра на основе высокоцитируемых публикаций выявляет до 12% ошибок против 32% при случайной выборке. На качество классификации влияет и изначальное количество категорий: чем меньше категорий участвует в классификации и чем больше терминологические различия между ними, тем выше её качество.

Download Full-text

Lossless Compression for Text and Images

International Journal of High Speed Electronics and Systems ◽

10.1142/s0129156497000068 ◽

1997 ◽

Vol 08 (01) ◽

pp. 179-231 ◽

Cited By ~ 6

Author(s):

Alistair Moffat ◽

Timothy C. Bell ◽

Ian H. Witten

Keyword(s):

Remote Sensing ◽

Lossless Compression ◽

Lossy Compression ◽

International Standards ◽

Experimental Results ◽

Extensive Discussion ◽

Standard Methods ◽

Sensing Applications ◽

Other Information ◽

Remote Sensing Applications

Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as images—particularly bilevel ones, or ones arising in medical and remote-sensing applications, or ones that may be required to be certified true for legal reasons. Moreover, during the process of lossy compression, many occasions for lossless compression of coefficients or other information arise. This paper surveys techniques for lossless compression. The process of compression can be broken down into modeling and coding. We provide an extensive discussion of coding techniques, and then introduce methods of modeling that are appropriate for text and images. Standard methods used in popular utilities (in the case of text) and international standards (in the case of images) are described.

Download Full-text