sequence compression
Recently Published Documents


TOTAL DOCUMENTS

122
(FIVE YEARS 1)

H-INDEX

13
(FIVE YEARS 0)

GigaScience ◽  
2020 ◽  
Vol 9 (11) ◽  
Author(s):  
Milton Silva ◽  
Diogo Pratas ◽  
Armando J Pinho

Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.



GigaScience ◽  
2020 ◽  
Vol 9 (7) ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

Abstract Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.



2019 ◽  
Vol 13 (6) ◽  
pp. 569-580
Author(s):  
Bonnie Ngai‐Fong Law


2019 ◽  
Vol 26 (7) ◽  
pp. 2159-2172
Author(s):  
Syed Mahamud Hossein ◽  
Debashis De ◽  
Pradeep Kumar Das Mohapatra


2019 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

AbstractBackgroundNearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.FindingsWe systematically benchmarked 410 settings of 44 compressors (including 26 specialized sequence compressors and 18 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 25 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/) that allows building custom visualizations for selected subsets of benchmark results.ConclusionWe found that modern compressors offer large improvement in compactness and speed compared to gzip. Our benchmark allows comparing compressors and their settings using a variety of performance measures, offering the opportunity to select the optimal compressor based on the data type and usage scenario specific to particular application.



2019 ◽  
Author(s):  
Pavithraa Seenivasan ◽  
Rishikesh Narayanan

ABSTRACTHippocampal place cells encode space through phase precession, whereby neuronal spike phase progressively advances during place-field traversals. What neural constraints are essential for achieving efficient transfer of information through such phase codes, while concomitantly maintaining signature neuronal excitability specific to individual cell types? We developed a conductance-based model for phase precession in CA1 pyramidal neurons within the temporal sequence compression framework, and defined phase-coding efficiency using information theory. We recruited an unbiased stochastic search strategy to build a model population that exhibited physiologically observed heterogeneities in intrinsic properties. Place-field responses elicited from these models matched signature sub- and supra-threshold place-cell characteristics, including phase precession, sub-threshold voltage ramps, increases in theta-frequency power and firing rate during place-field traversals. Employing this model population, we show that disparate parametric combinations with weak pair-wise correlations resulted in models with similar high-efficiency phase codes and similar excitability characteristics. Mechanistically, the emergence of such parametric degeneracy was dependent on the differential and variable impact of individual ion channels on phase-coding efficiency in different models, and importantly, on synergistic interactions between synaptic and intrinsic properties. Furthermore, our analyses predicted a dominant role for calcium-activated potassium channels in regulating phase precession and coding efficiency. Finally, change in afferent statistics, manifesting as input asymmetry, induced an adaptive shift in the phase code that preserved its efficiency, apart from introducing asymmetry in sub-threshold ramps and firing profiles during place-field traversals. Our study postulates degeneracy as a potential framework to attain the twin goals of efficient temporal coding and robust homeostasis.SIGNIFICANCE STATEMENTNeuronal intrinsic properties exhibit significant baseline heterogeneities, and change with activity-dependent plasticity and neuromodulation. How do hippocampal neurons encode spatial locations through the precise timings of their action potentials in the face of such heterogeneities? Here, employing a unifying synthesis of the temporal sequence compression, efficient coding and degeneracy frameworks, we show that there are several disparate routes for neurons to achieve high-efficiency spatial information transfer through such temporal codes. These disparate routes were consequent to the ability of neurons to produce precise encoding through distinct structural components, critically involving synergistic interactions between intrinsic and synaptic properties. Our results point to an explosion in the degrees of freedom available to a neuron in concomitantly achieving efficient coding and excitability homeostasis.



Sign in / Sign up

Export Citation Format

Share Document