An Approach to Efficient Dictionary Utilization and Improved Data Compression Technique for LZW Algorithm

This paper proposes an improved data compression technique compared to existing Lempel-Ziv-Welch (LZW) algorithm. LZW is a dictionary-updation based compression technique which stores elements from the data in the form of codes and uses them when those strings recur again. When the dictionary gets full, every element in the dictionary are removed in order to update dictionary with new entry. Therefore, the conventional method doesn’t consider frequently used strings and removes all the entry. This method is not an effective compression when the data to be compressed are large and when there are more frequently occurring string. This paper presents two new methods which are an improvement for the existing LZW compression algorithm. In this method, when the dictionary gets full, the elements that haven’t been used earlier are removed rather than removing every element of the dictionary which happens in the existing LZW algorithm. This is achieved by adding a flag to every element of the dictionary. Whenever an element is used the flag is set high. Thus, when the dictionary gets full, the dictionary entries where the flag was set high are kept and others are discarded. In the first method, the entries are discarded abruptly, whereas in the second method the unused elements are removed once at a time. Therefore, the second method gives enough time for the nascent elements of the dictionary. These techniques all fetch similar results when data set is small. This happens due to the fact that difference in the way they handle the dictionary when it’s full. Thus these improvements fetch better results only when a relatively large data is used. When all the three techniques' models were used to compare a data set with yields best case scenario, the compression ratios of conventional LZW is small compared to improved LZW method-1 and which in turn is small compared to improved LZW method-2.

Download Full-text

USE OF NEW EFFICIENT LOSSLESS DATA COMPRESSION METHOD IN TRANSMITTING ENCRYPTED BAPTISTA SYMMETRIC CHAOTIC CRYPTOSYSTEM DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8976 ◽

2016 ◽

Vol 78 (6-4) ◽

Author(s):

Muhamad Azlan Daud ◽

Muhammad Rezal Kamel Ariffin ◽

S. Kularajasingam ◽

Che Haziqah Che Hussin ◽

Nurliyana Juhan ◽

...

Keyword(s):

Data Compression ◽

Data Transmission ◽

Compression Algorithm ◽

Huffman Coding ◽

Compression Method ◽

Compression Technique ◽

Chaotic Dynamical System ◽

Fast Encoding ◽

Compression Mechanism ◽

Lossless Data Compression

A new compression algorithm used to ensure a modified Baptista symmetric cryptosystem which is based on a chaotic dynamical system to be applicable is proposed. The Baptista symmetric cryptosystem able to produce various ciphers responding to the same message input. This modified Baptista type cryptosystem suffers from message expansion that goes against the conventional methodology of a symmetric cryptosystem. A new lossless data compression algorithm based on theideas from the Huffman coding for data transmission is proposed.This new compression mechanism does not face the problem of mapping elements from a domain which is much larger than its range.Our new algorithm circumvent this problem via a pre-defined codeword list. The purposed algorithm has fast encoding and decoding mechanism and proven analytically to be a lossless data compression technique.

Download Full-text

A Connectome of the Adult Drosophila Central Brain

10.1101/2020.01.21.911859 ◽

2020 ◽

Cited By ~ 34

Author(s):

C. Shan Xu ◽

Michal Januszewski ◽

Zhiyuan Lu ◽

Shin-ya Takemura ◽

Kenneth J. Hayworth ◽

...

Keyword(s):

Large Fraction ◽

Large Data ◽

Fruit Fly ◽

Cell Types ◽

Brain Regions ◽

Central Complex ◽

Data Set ◽

New Methods ◽

Central Brain ◽

The Brain

AbstractThe neural circuits responsible for behavior remain largely unknown. Previous efforts have reconstructed the complete circuits of small animals, with hundreds of neurons, and selected circuits for larger animals. Here we (the FlyEM project at Janelia and collaborators at Google) summarize new methods and present the complete circuitry of a large fraction of the brain of a much more complex animal, the fruit fly Drosophila melanogaster. Improved methods include new procedures to prepare, image, align, segment, find synapses, and proofread such large data sets; new methods that define cell types based on connectivity in addition to morphology; and new methods to simplify access to a large and evolving data set. From the resulting data we derive a better definition of computational compartments and their connections; an exhaustive atlas of cell examples and types, many of them novel; detailed circuits for most of the central brain; and exploration of the statistics and structure of different brain compartments, and the brain as a whole. We make the data public, with a web site and resources specifically designed to make it easy to explore, for all levels of expertise from the expert to the merely curious. The public availability of these data, and the simplified means to access it, dramatically reduces the effort needed to answer typical circuit questions, such as the identity of upstream and downstream neural partners, the circuitry of brain regions, and to link the neurons defined by our analysis with genetic reagents that can be used to study their functions.Note: In the next few weeks, we will release a series of papers with more involved discussions. One paper will detail the hemibrain reconstruction with more extensive analysis and interpretation made possible by this dense connectome. Another paper will explore the central complex, a brain region involved in navigation, motor control, and sleep. A final paper will present insights from the mushroom body, a center of multimodal associative learning in the fly brain.

Download Full-text

Second compression for pixelated images under edge-based compression algorithms: JPEG-LS as an example

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201563 ◽

2021 ◽

pp. 1-9

Author(s):

Kamal Al-Khayyat ◽

Imad Al-Shaikhli ◽

Mohamad Al-Hagery

Keyword(s):

Data Compression ◽

Compression Ratio ◽

Lossless Compression ◽

General Purpose ◽

Compression Algorithm ◽

Random Data ◽

Data Set ◽

Compression Algorithms ◽

Edge Based

This paper details the examination of a particular case of data compression, where the compression algorithm removes the redundancy from data, which occurs when edge-based compression algorithms compress (previously compressed) pixelated images. The newly created redundancy can be removed using another round of compression. This work utilized the JPEG-LS as an example of an edge-based compression algorithm for compressing pixelated images. The output of this process was subjected to another round of compression using a more robust but slower compressor (PAQ8f). The compression ratio of the second compression was, on average, 18%, which is high for random data. The results of the second compression were superior to the lossy JPEG. Under the used data set, lossy JPEG needs to sacrifice 10% on average to realize nearly total lossless compression ratios of the two-successive compressions. To generalize the results, fast general-purpose compression algorithms (7z, bz2, and Gzip) were used too.

Download Full-text

The Building Genome Project: Indentify faults in building energy performance

Intersections Between the Academy and Practice, Papers from the 2017 AIA/ACSA Intersections Symposium ◽

10.35483/acsa.aia.inter.17.2 ◽

2017 ◽

Author(s):

Grant Mosey ◽

◽

Brian Deal ◽

Keyword(s):

Large Data ◽

Building Energy ◽

Energy Performance ◽

Genome Project ◽

Learning Processes ◽

Data Set ◽

Building Energy Performance ◽

New Methods ◽

Mechanical Faults ◽

Larger Sample

This paper explores the use of new tools for the creation of novel methods of identifying faults in building energy performance remotely. With the rise in availability of interval utility data and the proliferation of machine learning processes, new methods are arising which promise to bridge the gap between architects, engineers, auditors, operators, and utility personnel. Utility use information, viewed with sufficient granularity, can offer a sort of “genome, ”that is a set of “genes” which are unique to a given building and can be decoded to provide information about the building’s performance. The applications of algorithms to a large data set of these “genomes” can identify patterns across many buildings, providing the opportunity for identifying mechanical faults in a much larger sample of buildings that could previously be evaluated using traditional methods.

Download Full-text

Clustering large data sets based on data compression technique and weighted quality measures

2009 IEEE International Conference on Fuzzy Systems ◽

10.1109/fuzzy.2009.5277208 ◽

2009 ◽

Cited By ~ 2

Author(s):

M. Sassi ◽

A. Grissa

Keyword(s):

Data Compression ◽

Large Data ◽

Quality Measures ◽

Large Data Sets ◽

Data Sets ◽

Compression Technique

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text

Research on Data Compression Algorithm for Wireless Sensor Networks Based on Optimal Order Estimation and Distributed Clustering

JOURNAL OF ELECTRONICS INFORMATION TECHNOLOGY ◽

10.3724/sp.j.1146.2010.00529 ◽

2011 ◽

Vol 33 (3) ◽

pp. 569-574

Author(s):

Peng Jiang ◽

Sheng-qiang Li

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Data Compression ◽

Compression Algorithm ◽

Wireless Sensor ◽

Optimal Order ◽

Distributed Clustering ◽

Order Estimation

Download Full-text

Test data compression technique based on partial data block reuse in SoC

JOURNAL OF ELECTRONIC MEASUREMENT AND INSTRUMENT ◽

10.3724/sp.j.1187.2010.00487 ◽

2010 ◽

Vol 24 (5) ◽

pp. 487-493

Author(s):

Yiming Ouyang ◽

Xi'e Huang ◽

Huaguo Liang ◽

Baosheng Zou

Keyword(s):

Data Compression ◽

Test Data ◽

Data Block ◽

Test Data Compression ◽

Compression Technique ◽

Partial Data

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text