scholarly journals Design and Analysis of an Improved Nucleotide Sequences Compression Algorithm Using Look up Table (LUT)

Author(s):  
Govind Prasad Arya ◽  
R.K. Bharti ◽  
Devendra Prasad

DNA (deoxyribonucleic acid), is the hereditary material in humans and almost all other organisms. Nearly every cell in a person’s body has the same DNA. The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). With continuous technology development and growth of sequencing data, large amount of biological data is generated. This large amount of generated data causes difficulty to store, analyse and process DNA sequences. So there is a wide need of reducing the size, for this reason, DNA Compression is employed to reduce the size of DNA sequence. Therefore there is a huge need of compressing the DNA sequence. In this paper, we have proposed an efficient and fast DNA sequence compression algorithm based on differential direct coding and variable look up table (LUT).

2016 ◽  
Vol 2016 ◽  
pp. 1-7 ◽  
Author(s):  
Pamela Vinitha Eric ◽  
Gopakumar Gopalakrishnan ◽  
Muralikrishnan Karunakaran

This paper proposes a seed based lossless compression algorithm to compress a DNA sequence which uses a substitution method that is similar to the LempelZiv compression scheme. The proposed method exploits the repetition structures that are inherent in DNA sequences by creating an offline dictionary which contains all such repeats along with the details of mismatches. By ensuring that only promising mismatches are allowed, the method achieves a compression ratio that is at par or better than the existing lossless DNA sequence compression algorithms.


2015 ◽  
Vol 5 (4) ◽  
pp. 73-85 ◽  
Author(s):  
Subhankar Roy ◽  
Akash Bhagot ◽  
Kumari Annapurna Sharma ◽  
Sunirmal Khatua

2019 ◽  
Vol 48 (4) ◽  
pp. 1099-1106
Author(s):  
Emre Sevindik ◽  
Zehra Tuğba Murathan ◽  
Sümeyye Filiz ◽  
Kübra Yalçin

Genetic diversity among Turkish apple genotypes in Ardahan province was conducted based on cpDNA trnL-F sequences. Apple genotypes were plotted on a phylogenetic tree where Pyrus x bretschneideri was used as the outgroup. The plant samples were collected from different locations and genomic DNA was isolated from healthy and green leaves. For sequence in trnL-F region trnLe and trnFf primers were used. Later obtained DNA sequences were edited using the BioEdit and FinchTV. Sequencing data were analyzed using MEGA 6.0 software. Neighbor joining and bootstrap trees were constructed in order to verify the relationships among the apple genotypes. Phylogenetic tree consisted of two clades. The divergence values of trnL-F sequences differed between 0.000 and 0.005. Average nucleotide composition was 38.3 T, 14.9 C, 31.9 A and 14.9% G. The phylogenetic tree constructed based on trnL-F region sequences was nearly parallel to prior phylogenetic studies on apple genotypes.


2019 ◽  
Author(s):  
J.M. Lázaro-Guevara ◽  
K.M. Garrido

1.AbstractUndeveloped countries like Guatemala, where access to high-speed internet connections is limited, downloading and sharing Biological information of thousands of Mega Bits is a huge problem for the beginning and development of Bioinformatics. Based on that information is an urgent necessity to find a better way to share this biological data. There is when the compression algorithms become relevant. With all this information in mind, born the idea of creating a new algorithm using redundancy and approximate selection.Methods:Using the probability given by the transition matrix of the three-word tuple and relative frequencies. Calculating the relative and total frequencies given by the permutation formula (nr) and compressing 6 bits of information into 1 implementing the ASCII table code (0…255 characters, 28), using clusters of 102 DNA bases compacted into 17 string sequences. For decompressing, the inverse process must be done, except that the triplets must be selected randomly (or use a matrix dictionary, 4102).Conclusion:The compression algorithm has a better compression ratio than LZW and Huffman’s algorithm. However, the time needed for decompressing makes this algorithm incompatible for massive data. The functionality as MD5sum need more research but is a promising helpful tool for DNA checking.


GigaScience ◽  
2020 ◽  
Vol 9 (11) ◽  
Author(s):  
Milton Silva ◽  
Diogo Pratas ◽  
Armando J Pinho

Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7–3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Maleeha Najam ◽  
Raihan Ur Rasool ◽  
Hafiz Farooq Ahmad ◽  
Usman Ashraf ◽  
Asad Waqar Malik

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.


Algorithms ◽  
2020 ◽  
Vol 13 (6) ◽  
pp. 151
Author(s):  
Bruno Carpentieri

The increase in memory and in network traffic used and caused by new sequenced biological data has recently deeply grown. Genomic projects such as HapMap and 1000 Genomes have contributed to the very large rise of databases and network traffic related to genomic data and to the development of new efficient technologies. The large-scale sequencing of samples of DNA has brought new attention and produced new research, and thus the interest in the scientific community for genomic data has greatly increased. In a very short time, researchers have developed hardware tools, analysis software, algorithms, private databases, and infrastructures to support the research in genomics. In this paper, we analyze different approaches for compressing digital files generated by Next-Generation Sequencing tools containing nucleotide sequences, and we discuss and evaluate the compression performance of generic compression algorithms by confronting them with a specific system designed by Jones et al. specifically for genomic file compression: Quip. Moreover, we present a simple but effective technique for the compression of DNA sequences in which we only consider the relevant DNA data and experimentally evaluate its performances.


2008 ◽  
Vol 06 (02) ◽  
pp. 403-413 ◽  
Author(s):  
RAFAL POKRZYWA

The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows–Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.


Sign in / Sign up

Export Citation Format

Share Document