compressed data
Recently Published Documents


TOTAL DOCUMENTS

343
(FIVE YEARS 98)

H-INDEX

20
(FIVE YEARS 4)

2022 ◽  
Vol 16 (2) ◽  
pp. 1-21
Author(s):  
Michael Nelson ◽  
Sridhar Radhakrishnan ◽  
Chandra Sekharan ◽  
Amlan Chatterjee ◽  
Sudhindra Gopal Krishna

Time-evolving web and social network graphs are modeled as a set of pages/individuals (nodes) and their arcs (links/relationships) that change over time. Due to their popularity, they have become increasingly massive in terms of their number of nodes, arcs, and lifetimes. However, these graphs are extremely sparse throughout their lifetimes. For example, it is estimated that Facebook has over a billion vertices, yet at any point in time, it has far less than 0.001% of all possible relationships. The space required to store these large sparse graphs may not fit in most main memories using underlying representations such as a series of adjacency matrices or adjacency lists. We propose building a compressed data structure that has a compressed binary tree corresponding to each row of each adjacency matrix of the time-evolving graph. We do not explicitly construct the adjacency matrix, and our algorithms take the time-evolving arc list representation as input for its construction. Our compressed structure allows for directed and undirected graphs, faster arc and neighborhood queries, as well as the ability for arcs and frames to be added and removed directly from the compressed structure (streaming operations). We use publicly available network data sets such as Flickr, Yahoo!, and Wikipedia in our experiments and show that our new technique performs as well or better than our benchmarks on all datasets in terms of compression size and other vital metrics.


2022 ◽  
Vol 118 ◽  
pp. 102999
Author(s):  
Yaomei Wang ◽  
Worakanok Thanyamanta ◽  
Neil Bose

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Li Guo ◽  
Kunlin Zhu ◽  
Ruijun Duan

In order to explore the economic development trend in the postepidemic era, this paper improves the traditional clustering algorithm and constructs a postepidemic economic development trend analysis model based on intelligent algorithms. In order to solve the clustering problem of large-scale nonuniform density data sets, this paper proposes an adaptive nonuniform density clustering algorithm based on balanced iterative reduction and uses the algorithm to further cluster the compressed data sets. For large-scale data sets, the clustering results can accurately reflect the class characteristics of the data set as a whole. Moreover, the algorithm greatly improves the time efficiency of clustering. From the research results, we can see that the improved clustering algorithm has a certain effect on the analysis of economic development trends in the postepidemic era and can continue to play a role in subsequent economic analysis.


2021 ◽  
Vol 26 (1) ◽  
pp. 1-47
Author(s):  
Diego Arroyuelo ◽  
Rodrigo Cánovas ◽  
Johannes Fischer ◽  
Dominik Köppl ◽  
Marvin Löbel ◽  
...  

The Lempel-Ziv 78 ( LZ78 ) and Lempel-Ziv-Welch ( LZW ) text factorizations are popular, not only for bare compression but also for building compressed data structures on top of them. Their regular factor structure makes them computable within space bounded by the compressed output size. In this article, we carry out the first thorough study of low-memory LZ78 and LZW text factorization algorithms, introducing more efficient alternatives to the classical methods, as well as new techniques that can run within less memory space than the necessary to hold the compressed file. Our results build on hash-based representations of tries that may have independent interest.


2021 ◽  
Vol 17 (10) ◽  
pp. e1009524
Author(s):  
Shian Su ◽  
Quentin Gouil ◽  
Marnie E. Blewitt ◽  
Dianne Cook ◽  
Peter F. Hickey ◽  
...  

A key benefit of long-read nanopore sequencing technology is the ability to detect modified DNA bases, such as 5-methylcytosine. The lack of R/Bioconductor tools for the effective visualization of nanopore methylation profiles between samples from different experimental groups led us to develop the NanoMethViz R package. Our software can handle methylation output generated from a range of different methylation callers and manages large datasets using a compressed data format. To fully explore the methylation patterns in a dataset, NanoMethViz allows plotting of data at various resolutions. At the sample-level, we use dimensionality reduction to look at the relationships between methylation profiles in an unsupervised way. We visualize methylation profiles of classes of features such as genes or CpG islands by scaling them to relative positions and aggregating their profiles. At the finest resolution, we visualize methylation patterns across individual reads along the genome using the spaghetti plot and heatmaps, allowing users to explore particular genes or genomic regions of interest. In summary, our software makes the handling of methylation signal more convenient, expands upon the visualization options for nanopore data and works seamlessly with existing methylation analysis tools available in the Bioconductor project. Our software is available at https://bioconductor.org/packages/NanoMethViz.


2021 ◽  
pp. 89-104
Author(s):  
Yoshimasa Takabatake ◽  
Tomohiro I ◽  
Hiroshi Sakamoto

AbstractWe survey our recent work related to information processing on compressed strings. Note that a “string” here contains any fixed-length sequence of symbols and therefore includes not only ordinary text but also a wide range of data, such as pixel sequences and time-series data. Over the past two decades, a variety of algorithms and their applications have been proposed for compressed information processing. In this survey, we mainly focus on two problems: recompression and privacy-preserving computation over compressed strings. Recompression is a framework in which algorithms transform a given compressed data into another compressed format without decompression. Recent studies have shown that a higher compression ratio can be achieved at lower cost by using an appropriate recompression algorithm such as preprocessing. Furthermore, various privacy-preserving computation models have been proposed for information retrieval, similarity computation, and pattern mining.


2021 ◽  
Author(s):  
Enrico Pomarico ◽  
Cédric Schmidt ◽  
Florian Chays ◽  
David Nguyen ◽  
Arielle Planchette ◽  
...  

Abstract The growth of data throughput in optical microscopy has triggered the extensive use of supervised learning (SL) models on compressed datasets for automated analysis. Investigating the effects of image compression on SL predictions is therefore pivotal to assess their reliability, especially for clinical use.We quantify the statistical distortions induced by compression through the comparison of predictions on compressed data to the raw predictive uncertainty, numerically estimated from the raw noise statistics measured via sensor calibration. Predictions on cell segmentation parameters are altered by up to 15% and more than 10 standard deviations after 16-to-8 bits pixel depth reduction and 10:1 JPEG compression. JPEG formats with higher compression ratios show significantly larger distortions. Interestingly, a recent metrologically accurate algorithm, offering up to 10:1 compression ratio, provides a prediction spread equivalent to that stemming from raw noise. The method described here allows to set a lower bound to the predictive uncertainty of a SL task and can be generalized to determine the statistical distortions originated from a variety of processing pipelines in AI-assisted fields.


2021 ◽  
Author(s):  
Alex Marchioni ◽  
Andriy Enttsel ◽  
Mauro Mangia ◽  
Riccardo Rovatti ◽  
Gianluca Setti

<div>We analyze the effect of lossy compression in the processing of sensor signals that must be used to detect anomalous events in the system under observation. The intuitive relationship between the quality loss at higher compression and the possibility of telling anomalous behaviours from normal ones is formalized in terms of information-theoretic quantities. Some analytic derivations are made within the Gaussian framework and possibly in the asymptotic regime for what concerns the stretch of signals considered.</div><div>Analytical conclusions are matched with the performance of practical detectors in a toy case allowing the assessment of different compression/detector configurations.</div>


2021 ◽  
Author(s):  
Alex Marchioni ◽  
Andriy Enttsel ◽  
Mauro Mangia ◽  
Riccardo Rovatti ◽  
Gianluca Setti

<div>We analyze the effect of lossy compression in the processing of sensor signals that must be used to detect anomalous events in the system under observation. The intuitive relationship between the quality loss at higher compression and the possibility of telling anomalous behaviours from normal ones is formalized in terms of information-theoretic quantities. Some analytic derivations are made within the Gaussian framework and possibly in the asymptotic regime for what concerns the stretch of signals considered.</div><div>Analytical conclusions are matched with the performance of practical detectors in a toy case allowing the assessment of different compression/detector configurations.</div>


2021 ◽  
Vol 75 (3) ◽  
pp. 100-107
Author(s):  
B.-B.S. Yesmagambetov ◽  

When processing huge data streams in information systems, individual measurements or whole groups of measurements can be distorted or lost due to various reasons. Recovery of compressed data during transmission on communication channels is accompanied by errors related to distortion of information and service parts of messages due to presence of interference in transmission channel. To these errors are added errors caused by quantization of the transmitted implementations by level and time sampling. Research on methods of increasing noise immunity both during transmission and during recovery of measured data is an urgent task in the design of information and measurement systems. The article considers non-parametric methods of estimating probabilistic characteristics of random processes. A distinctive feature of non-parametric methods is the ranking of data measured at the observation interval. It is shown that ranking of data on transmitting side of information-measuring system enables correction of errors and failures based on strict monotony of ranked number of codes. Also, the error of recovery of continuous implementations taking into account distortions of compressed data in the communication channel was investigated. The obtained results indicate that the use of complex compression algorithms is impractical, since the difference in the error in the restoration of non-stationary messages between the simplest algorithm and the rather difficult one becomes negligible. The article presents the results of estimating recovery errors for various data compression methods.


Sign in / Sign up

Export Citation Format

Share Document