scholarly journals Genomic Sequence Data Compression using Lempel-Ziv-Welch Algorithm with Indexed Multiple Dictionary

With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.

2021 ◽  
Author(s):  
Ziheng Yang ◽  
Thomas Flouris

The multispecies coalescent with introgression (MSci) model accommodates both the coalescent process and cross-species introgression/ hybridization events, two major processes that create genealogical fluctuations across the genome and gene-tree-species-tree discordance. Full likelihood implementations of the MSci model take such fluctuations as a major source of information about the history of species divergence and gene flow, and provide a powerful tool for estimating the direction, timing and strength of cross-species introgression using multilocus sequence data. However, introgression models, in particular those that accommodate bidirectional introgression (BDI), are known to cause unidentifiability issues of the label-switching type, whereby different models or parameters make the same predictions about the genomic data and thus cannot be distinguished by the data. Nevertheless, there has been no systematic study of unidentifiability when full likelihood methods are applied. Here we characterize the unidentifiability of arbitrary BDI models and derive simple rules for its identification. In general, an MSci model with k BDI events has 2^k unidentifiable towers in the posterior, with each BDI event between sister species creating within-model unidentifiability and each BDI between non-sister species creating cross-model unidentifiability. We develop novel algorithms for processing Markov chain Monte Carlo (MCMC) samples to remove label switching and implement them in the BPP program. We analyze genomic sequence data from Heliconius butterflies as well as synthetic data to illustrate the utility of the BDI models and the new algorithms.


2020 ◽  
Author(s):  
Tomasz Kowalski ◽  
Szymon Grabowski

AbstractMotivationFASTQ remains among the widely used formats for high-throughput sequencing data. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs.ResultsWe present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. The current version, v1.2, practically preserves the compression ratio and decompression speed of the previous one, reducing the compression time by a factor of about 4–5 on a 6-core/12-thread machine.AvailabilityPgRC 1.2 can be downloaded from https://github.com/kowallus/[email protected]


2020 ◽  
Vol 7 (9) ◽  
pp. 201206
Author(s):  
Margarita Hernandez ◽  
Mary K. Shenk ◽  
George H. Perry

Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r 2 = 0.37; p = 6.15 × 10 −12 ) and number of medical publications ( r 2 = 0.27; p = 9.27 × 10 −9 ). In a generalized linear model, the number of non-medical publications ( p = 0.00064) and closer phylogenetic distance to humans ( p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.


2019 ◽  
Vol 35 (19) ◽  
pp. 3826-3828 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 78 (6-4) ◽  
Author(s):  
Atabak Kheirkhah ◽  
Salwani Mohd Daud ◽  
Noor Azurati Ahmad @ Salleh ◽  
Suriani Mohd Sam ◽  
Hafiza Abas ◽  
...  

This paper intends to review computational methods and high throughput automated tools for precisely prediction various functionalities of uncharacterized proteins based on their desired DNA sequence information alone. Then proposes a hybrid weighted network and Genetic Algorithm to improve prediction purpose. The main advantage of the method is the protein function and DNA sequence prediction can be computed precisely using best fitness parent in genetic algorithm. With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased exponentially and the pace is much slower in determining their biological attributes. The gap between DNA sequence variants and their functionalities has become increasingly large. However, detection of sequences based on protein data bank has become benchmark for many researchers. As amount of DNA sequence data continues to increase, the fundamental problem stay at the front of genome analysis. In the course of developing these methods, the following matters were often needed to consider: benchmark dataset construction, gene sequence prediction, operating algorithm, anticipated accuracy, gene recommender and functional integrations. In this review, we are to discuss each of them, with a different focus on operational algorithms and how to increase the accuracy of DNA sequence variants prediction.


2013 ◽  
Vol 16 (1) ◽  
pp. 1-15 ◽  
Author(s):  
Z. Zhu ◽  
Y. Zhang ◽  
Z. Ji ◽  
S. He ◽  
X. Yang

2016 ◽  
Vol 14 (03) ◽  
pp. 1630002 ◽  
Author(s):  
Muhammad Sardaraz ◽  
Muhammad Tahir ◽  
Ataul Aziz Ikram

Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.


2018 ◽  
Author(s):  
Kirill Kryukov ◽  
Mahoko Takahashi Ueda ◽  
So Nakagawa ◽  
Tadashi Imanishi

AbstractSummaryDNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) – a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.AvailabilityNAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any [email protected]


Sign in / Sign up

Export Citation Format

Share Document