scholarly journals Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

2018 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

AbstractSummaryDNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) – a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.AvailabilityNAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any [email protected]

2019 ◽  
Vol 35 (19) ◽  
pp. 3826-3828 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Sergey V. Petoukhov

One of creators of quantum mechanics P. Jordan in his work on quantum biology claimed that life's missing laws were the rules of chance and probability of the quantum world. The article presents author’s results of studying probabilities of nucleotides on so-called epi-chains of long DNA sequences of various eukaryotic and prokaryotic genomes. DNA epi-chains are algorithmically constructed subsequencies of DNA nucleotide sequences. According to the algorithm of construction of any epi-chain of the order n, the epi-chain is such nucleotide subsequence, in which the numerations of adjacent nucleotides differ by n    (n = 2, 3, 4,…). Correspondingly each epi-chain of order n contains n times less nucleotides than the original DNA sequence. The presented results unexpectedly show that nucleotide probabilities on such DNA epi-chains of different orders are practically identical to nucleotide probabilities in the original long DNA sequence. These data allow considering DNA as a regular rich set of epi-chains, which can play a certain role in genetic and epigenetic phenomena as the author belives. Appropriate rules of nucleotide probabilities on epi-chains of long DNA sequences are formulated for further their tests on a wider set of biological genomes. These phenomenological data and their possible biological meaning are discussed.


With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.


Author(s):  
Rosario Gilmary ◽  
Murugesan G

Deoxyribonucleic acid called DNA is the smallest fundamental unit that bears the genetic instructions of a living organism. It is used in the up growth and functioning of all known living organisms. Current DNA sequencing equipment creates extensive heaps of genomic data. The Nucleotide databases like GenBank, size getting 2 to 3 times larger annually. The increase in genomic data outstrips the increase in storage capacity. Massive amount of genomic data needs an effectual depository, quick transposal and preferable performance. To reduce storage of abundant data and data storage expense, compression algorithms were used. Typical compression approaches lose status while compressing these sequences. However, novel compression algorithms have been introduced for better compression ratio. The performance is correlated in terms of compression ratio; ratio of the capacity of compressed file and compression/decompression time; time taken to compress/decompress the sequence. In the proposed work, the input DNA sequence is compressed by reconstructing the sequence into varied formats. Here the input DNA sequence is subjected to bit reduction. The binary output is converted to hexadecimal format followed by encoding. Thus, the compression ratio of the biological sequence is improved.


Author(s):  
Barbara Trask ◽  
Susan Allen ◽  
Anne Bergmann ◽  
Mari Christensen ◽  
Anne Fertitta ◽  
...  

Using fluorescence in situ hybridization (FISH), the positions of DNA sequences can be discretely marked with a fluorescent spot. The efficiency of marking DNA sequences of the size cloned in cosmids is 90-95%, and the fluorescent spots produced after FISH are ≈0.3 μm in diameter. Sites of two sequences can be distinguished using two-color FISH. Different reporter molecules, such as biotin or digoxigenin, are incorporated into DNA sequence probes by nick translation. These reporter molecules are labeled after hybridization with different fluorochromes, e.g., FITC and Texas Red. The development of dual band pass filters (Chromatechnology) allows these fluorochromes to be photographed simultaneously without registration shift.


2018 ◽  
Author(s):  
William A. Shirley ◽  
Brian P. Kelley ◽  
Yohann Potier ◽  
John H. Koschwanez ◽  
Robert Bruccoleri ◽  
...  

This pre-print explores ensemble modeling of natural product targets to match chemical structures to precursors found in large open-source gene cluster repository antiSMASH. Commentary on method, effectiveness, and limitations are enclosed. All structures are public domain molecules and have been reviewed for release.


Author(s):  
Hikka Sartika ◽  
Taronisokhi Zebua

Storage space required by an application is one of the problems on smartphones. This problem can result in a waste of storage space because not all smartphones have a very large storage capacity. One application that has a large file size is the RPUL application and this application is widely accessed by students and the general public. Large file size is what often causes this application can not run effectively on smartphones. One solution that can be used to solve this problem is to compress the application file, so that the size of the storage space needed in the smartphone is much smaller. This study describes how the application of the elias gamma code algorithm as one of the compression technique algorithms to compress the RPUL application database file. This is done so that the RPUL application can run effectively on a smartphone after it is installed. Based on trials conducted on 64 bit of text as samples in this research it was found that compression based on the elias gamma code algorithm is able to compress text from a database file with a ratio of compression is 2 bits, compression ratio is 50% with a redundancy is 50%. Keywords: Compression, RPUL, Smartphone, Elias Gamma Code


Author(s):  
Winda Winda ◽  
Taronisokhi Zebua

The size of the data that is owned by an application today is very influential on the amount of space in the memory needed one of which is a mobile-based application. One mobile application that is widely used by students and the public at this time is the Complete Natural Knowledge Summary (Rangkuman Pengetahuan Alam Lengkap or RPAL) application. The RPAL application requires a large amount of material storage space in the mobile memory after it has been installed, so it can cause this application to be ineffective (slow). Compression of data can be used as a solution to reduce the size of the data so as to minimize the need for space in memory. The levestein algorithm is a compression technique algorithm that can be used to compress material stored in the RPAL application database, so that the database size is small. This study describes how to compress the RPAL application database records, so as to minimize the space needed on memory. Based on tests conducted on 128 characters of data (200 bits), the compression results obtained of 136 bits (17 characters) with a compression ratio is 68% and redundancy is 32%.Keywords: compression, levestein, aplication, RPAL, text, database, mobile


Author(s):  
Hui Yang ◽  
Anand Nayyar

: In the fast development of information, the information data is increasing in geometric multiples, and the speed of information transmission and storage space are required to be higher. In order to reduce the use of storage space and further improve the transmission efficiency of data, data need to be compressed. processing. In the process of data compression, it is very important to ensure the lossless nature of data, and lossless data compression algorithms appear. The gradual optimization design of the algorithm can often achieve the energy-saving optimization of data compression. Similarly, The effect of energy saving can also be obtained by improving the hardware structure of node. In this paper, a new structure is designed for sensor node, which adopts hardware acceleration, and the data compression module is separated from the node microprocessor.On the basis of the ASIC design of the algorithm, by introducing hardware acceleration, the energy consumption of the compressed data was successfully reduced, and the proportion of energy consumption and compression time saved by the general-purpose processor was as high as 98.4 % and 95.8 %, respectively. It greatly reduces the compression time and energy consumption.


2013 ◽  
Vol 41 (2) ◽  
pp. 548-553 ◽  
Author(s):  
Andrew A. Travers ◽  
Georgi Muskhelishvili

How much information is encoded in the DNA sequence of an organism? We argue that the informational, mechanical and topological properties of DNA are interdependent and act together to specify the primary characteristics of genetic organization and chromatin structures. Superhelicity generated in vivo, in part by the action of DNA translocases, can be transmitted to topologically sensitive regions encoded by less stable DNA sequences.


Sign in / Sign up

Export Citation Format

Share Document