Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Mapping Intimacies ◽

10.1101/501130 ◽

2018 ◽

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

Public Domain ◽

General Purpose ◽

Nucleotide Sequences ◽

File Format ◽

Storage Space ◽

Network Transmission

AbstractSummaryDNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) – a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.AvailabilityNAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any [email protected]

Get full-text (via PubEx)

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Bioinformatics ◽

10.1093/bioinformatics/btz144 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3826-3828 ◽

Cited By ~ 9

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

General Purpose ◽

Supplementary Information ◽

File Format ◽

Storage Space ◽

Supplementary Data ◽

Network Transmission

Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.

Get full-text (via PubEx)

Nucleotide Epi-Chains and New Nucleotide Probability Rules in Long DNA Sequences

10.20944/preprints201904.0011.v1 ◽

2019 ◽

Author(s):

Sergey V. Petoukhov

Keyword(s):

Quantum Mechanics ◽

Dna Sequence ◽

Dna Sequences ◽

Nucleotide Sequences ◽

Quantum Biology ◽

Biological Meaning ◽

Prokaryotic Genomes

One of creators of quantum mechanics P. Jordan in his work on quantum biology claimed that life's missing laws were the rules of chance and probability of the quantum world. The article presents author’s results of studying probabilities of nucleotides on so-called epi-chains of long DNA sequences of various eukaryotic and prokaryotic genomes. DNA epi-chains are algorithmically constructed subsequencies of DNA nucleotide sequences. According to the algorithm of construction of any epi-chain of the order n, the epi-chain is such nucleotide subsequence, in which the numerations of adjacent nucleotides differ by n    (n = 2, 3, 4,…). Correspondingly each epi-chain of order n contains n times less nucleotides than the original DNA sequence. The presented results unexpectedly show that nucleotide probabilities on such DNA epi-chains of different orders are practically identical to nucleotide probabilities in the original long DNA sequence. These data allow considering DNA as a regular rich set of epi-chains, which can play a certain role in genetic and epigenetic phenomena as the author belives. Appropriate rules of nucleotide probabilities on epi-chains of long DNA sequences are formulated for further their tests on a wider set of biological genomes. These phenomenological data and their possible biological meaning are discussed.

Get full-text (via PubEx)

Genomic Sequence Data Compression using Lempel-Ziv-Welch Algorithm with Indexed Multiple Dictionary

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3278.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 541-547

Keyword(s):

Dna Sequence ◽

High Throughput ◽

Compression Ratio ◽

Genomic Sequence ◽

Sequence Data ◽

Genomic Data ◽

General Purpose ◽

Huge Amount ◽

Compression Time ◽

Average Size

With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.

Get full-text (via PubEx)

Bit Reduction based Compression Algorithm for DNA Sequences

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset218529 ◽

2021 ◽

pp. 270-277

Author(s):

Rosario Gilmary ◽

Murugesan G

Keyword(s):

Dna Sequence ◽

Data Storage ◽

Compression Ratio ◽

Dna Sequences ◽

Genomic Data ◽

Living Organism ◽

Biological Sequence ◽

Compression Algorithms ◽

Living Organisms ◽

Abundant Data

Deoxyribonucleic acid called DNA is the smallest fundamental unit that bears the genetic instructions of a living organism. It is used in the up growth and functioning of all known living organisms. Current DNA sequencing equipment creates extensive heaps of genomic data. The Nucleotide databases like GenBank, size getting 2 to 3 times larger annually. The increase in genomic data outstrips the increase in storage capacity. Massive amount of genomic data needs an effectual depository, quick transposal and preferable performance. To reduce storage of abundant data and data storage expense, compression algorithms were used. Typical compression approaches lose status while compressing these sequences. However, novel compression algorithms have been introduced for better compression ratio. The performance is correlated in terms of compression ratio; ratio of the capacity of compressed file and compression/decompression time; time taken to compress/decompress the sequence. In the proposed work, the input DNA sequence is compressed by reconstructing the sequence into varied formats. Here the input DNA sequence is subjected to bit reduction. The binary output is converted to hexadecimal format followed by encoding. Thus, the compression ratio of the biological sequence is improved.

Get full-text (via PubEx)

DNA sequence mapping in interphase and metaphase chromosomes by fluorescence in situ hybridization

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100122885 ◽

1992 ◽

Vol 50 (1) ◽

pp. 496-497

Author(s):

Barbara Trask ◽

Susan Allen ◽

Anne Bergmann ◽

Mari Christensen ◽

Anne Fertitta ◽

...

Keyword(s):

In Situ Hybridization ◽

Dna Sequence ◽

Dna Sequences ◽

Dual Band ◽

Nick Translation ◽

Metaphase Chromosomes ◽

Band Pass ◽

Texas Red ◽

Fluorescent Spot

Using fluorescence in situ hybridization (FISH), the positions of DNA sequences can be discretely marked with a fluorescent spot. The efficiency of marking DNA sequences of the size cloned in cosmids is 90-95%, and the fluorescent spots produced after FISH are ≈0.3 μm in diameter. Sites of two sequences can be distinguished using two-color FISH. Different reporter molecules, such as biotin or digoxigenin, are incorporated into DNA sequence probes by nick translation. These reporter molecules are labeled after hybridization with different fluorochromes, e.g., FITC and Texas Red. The development of dual band pass filters (Chromatechnology) allows these fluorochromes to be photographed simultaneously without registration shift.

Get full-text (via PubEx)

Unzipping Natural Products: Improved Natural Product Structure Predictions by Ensemble Modeling and Fingerprint Matching

10.26434/chemrxiv.6863864.v1 ◽

2018 ◽

Author(s):

William A. Shirley ◽

Brian P. Kelley ◽

Yohann Potier ◽

John H. Koschwanez ◽

Robert Bruccoleri ◽

...

Keyword(s):

Natural Products ◽

Open Source ◽

Natural Product ◽

Gene Cluster ◽

Public Domain ◽

Product Structure ◽

Fingerprint Matching ◽

Ensemble Modeling ◽

Chemical Structures ◽

Source Gene

This pre-print explores ensemble modeling of natural product targets to match chemical structures to precursors found in large open-source gene cluster repository antiSMASH. Commentary on method, effectiveness, and limitations are enclosed. All structures are public domain molecules and have been reviewed for release.

Get full-text (via PubEx)

PERANCANGAN DAN IMPLEMENTASI ALGORITMA ELIAS GAMMA CODE UNTUK MENGKOMPRESI RECORD DATABASE PADA APLIKASI RANGKUMAN PENGETAHUAN UMUM LENGKAP

KOMIK (Konferensi Nasional Teknologi Informasi dan Komputer) ◽

10.30865/komik.v3i1.1600 ◽

2019 ◽

Vol 3 (1) ◽

Author(s):

Hikka Sartika ◽

Taronisokhi Zebua

Keyword(s):

Compression Ratio ◽

General Public ◽

Storage Capacity ◽

Storage Space ◽

File Size ◽

Compression Technique ◽

Code Algorithm ◽

Database File ◽

Large File

Storage space required by an application is one of the problems on smartphones. This problem can result in a waste of storage space because not all smartphones have a very large storage capacity. One application that has a large file size is the RPUL application and this application is widely accessed by students and the general public. Large file size is what often causes this application can not run effectively on smartphones. One solution that can be used to solve this problem is to compress the application file, so that the size of the storage space needed in the smartphone is much smaller. This study describes how the application of the elias gamma code algorithm as one of the compression technique algorithms to compress the RPUL application database file. This is done so that the RPUL application can run effectively on a smartphone after it is installed. Based on trials conducted on 64 bit of text as samples in this research it was found that compression based on the elias gamma code algorithm is able to compress text from a database file with a ratio of compression is 2 bits, compression ratio is 50% with a redundancy is 50%. Keywords: Compression, RPUL, Smartphone, Elias Gamma Code

Get full-text (via PubEx)

IMPLEMENTASI ALGORITMA LEVENSTEIN DALAM MENGKOMPRESI TEKS PADA APLIKASI RANGKUMAN PENGETAHUAN ALAM LENGKAP

KOMIK (Konferensi Nasional Teknologi Informasi dan Komputer) ◽

10.30865/komik.v3i1.1594 ◽

2019 ◽

Vol 3 (1) ◽

Author(s):

Winda Winda ◽

Taronisokhi Zebua

Keyword(s):

Compression Ratio ◽

Mobile Application ◽

Storage Space ◽

Compression Technique ◽

The Public ◽

Natural Knowledge ◽

Database Size ◽

Text Database

The size of the data that is owned by an application today is very influential on the amount of space in the memory needed one of which is a mobile-based application. One mobile application that is widely used by students and the public at this time is the Complete Natural Knowledge Summary (Rangkuman Pengetahuan Alam Lengkap or RPAL) application. The RPAL application requires a large amount of material storage space in the mobile memory after it has been installed, so it can cause this application to be ineffective (slow). Compression of data can be used as a solution to reduce the size of the data so as to minimize the need for space in memory. The levestein algorithm is a compression technique algorithm that can be used to compress material stored in the RPAL application database, so that the database size is small. This study describes how to compress the RPAL application database records, so as to minimize the space needed on memory. Based on tests conducted on 128 characters of data (200 bits), the compression results obtained of 136 bits (17 characters) with a compression ratio is 68% and redundancy is 32%.Keywords: compression, levestein, aplication, RPAL, text, database, mobile

Get full-text (via PubEx)

Design and Implementation of Low Energy Wireless Network Nodes based on Hardware Compression Acceleration

Recent Patents on Computer Science ◽

10.2174/2213275912666190715164024 ◽

2019 ◽

Vol 12 ◽

Author(s):

Hui Yang ◽

Anand Nayyar

Keyword(s):

Energy Consumption ◽

Data Compression ◽

Energy Saving ◽

Optimization Design ◽

Hardware Acceleration ◽

Transmission Efficiency ◽

General Purpose ◽

Storage Space ◽

General Purpose Processor ◽

Compression Time

: In the fast development of information, the information data is increasing in geometric multiples, and the speed of information transmission and storage space are required to be higher. In order to reduce the use of storage space and further improve the transmission efficiency of data, data need to be compressed. processing. In the process of data compression, it is very important to ensure the lossless nature of data, and lossless data compression algorithms appear. The gradual optimization design of the algorithm can often achieve the energy-saving optimization of data compression. Similarly, The effect of energy saving can also be obtained by improving the hardware structure of node. In this paper, a new structure is designed for sensor node, which adopts hardware acceleration, and the data compression module is separated from the node microprocessor.On the basis of the ASIC design of the algorithm, by introducing hardware acceleration, the energy consumption of the compressed data was successfully reduced, and the proportion of energy consumption and compression time saved by the general-purpose processor was as high as 98.4 % and 95.8 %, respectively. It greatly reduces the compression time and energy consumption.

Get full-text (via PubEx)

DNA thermodynamics shape chromosome organization and topology

Biochemical Society Transactions ◽

10.1042/bst20120334 ◽

2013 ◽

Vol 41 (2) ◽

pp. 548-553 ◽

Cited By ~ 13

Author(s):

Andrew A. Travers ◽

Georgi Muskhelishvili

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Chromosome Organization ◽

Topological Properties ◽

Genetic Organization ◽

And Topology ◽

Dna Translocases

How much information is encoded in the DNA sequence of an organism? We argue that the informational, mechanical and topological properties of DNA are interdependent and act together to specify the primary characteristics of genetic organization and chromatin structures. Superhelicity generated in vivo, in part by the action of DNA translocases, can be transmitted to topologically sensitive regions encoded by less stable DNA sequences.

Get full-text (via PubEx)