Genomic Sequence Data Compression using Lempel-Ziv-Welch Algorithm with Indexed Multiple Dictionary

With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.

Download Full-text

Estimation of Cross-Species Introgression Rates using Genomic Data Despite Model Unidentifiability

10.1101/2021.08.14.456331 ◽

2021 ◽

Author(s):

Ziheng Yang ◽

Thomas Flouris

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Gene Tree ◽

Synthetic Data ◽

Genomic Data ◽

Sister Species ◽

Label Switching ◽

Cross Model ◽

Multispecies Coalescent ◽

Full Likelihood

The multispecies coalescent with introgression (MSci) model accommodates both the coalescent process and cross-species introgression/ hybridization events, two major processes that create genealogical fluctuations across the genome and gene-tree-species-tree discordance. Full likelihood implementations of the MSci model take such fluctuations as a major source of information about the history of species divergence and gene flow, and provide a powerful tool for estimating the direction, timing and strength of cross-species introgression using multilocus sequence data. However, introgression models, in particular those that accommodate bidirectional introgression (BDI), are known to cause unidentifiability issues of the label-switching type, whereby different models or parameters make the same predictions about the genomic data and thus cannot be distinguished by the data. Nevertheless, there has been no systematic study of unidentifiability when full likelihood methods are applied. Here we characterize the unidentifiability of arbitrary BDI models and derive simple rules for its identification. In general, an MSci model with k BDI events has 2^k unidentifiable towers in the posterior, with each BDI event between sister species creating within-model unidentifiability and each BDI between non-sister species creating cross-model unidentifiability. We develop novel algorithms for processing Markov chain Monte Carlo (MCMC) samples to remove label switching and implement them in the BPP program. We analyze genomic sequence data from Heliconius butterflies as well as synthetic data to illustrate the utility of the BDI models and the new algorithms.

Download Full-text

Engineering the Compression of Sequencing Reads

10.1101/2020.05.01.071720 ◽

2020 ◽

Author(s):

Tomasz Kowalski ◽

Szymon Grabowski

Keyword(s):

High Throughput ◽

Compression Ratio ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Current Version ◽

High Quality ◽

High Throughput Sequencing Data ◽

Compression Time ◽

Practical Performance ◽

Shortest Common Superstring

AbstractMotivationFASTQ remains among the widely used formats for high-throughput sequencing data. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs.ResultsWe present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. The current version, v1.2, practically preserves the compression ratio and decompression speed of the previous one, reducing the compression time by a factor of about 4–5 on a 6-core/12-thread machine.AvailabilityPgRC 1.2 can be downloaded from https://github.com/kowallus/[email protected]

Download Full-text

The genomic data deficit: On the need to inform research subjects of the informational content of their genomic sequence data in consent for genomic research

Computer Law & Security Review ◽

10.1016/j.clsr.2020.105427 ◽

2020 ◽

Vol 37 ◽

pp. 105427

Author(s):

Dara Hallinan

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Genomic Data ◽

Genomic Research ◽

Informational Content ◽

Research Subjects

Download Full-text

Factors influencing taxonomic unevenness in scientific research: a mixed-methods case study of non-human primate genomic sequence data generation

Royal Society Open Science ◽

10.1098/rsos.201206 ◽

2020 ◽

Vol 7 (9) ◽

pp. 201206

Author(s):

Margarita Hernandez ◽

Mary K. Shenk ◽

George H. Perry

Keyword(s):

Mixed Methods ◽

Genomic Sequence ◽

Sequence Data ◽

Scientific Research ◽

Genomic Data ◽

Primate Species ◽

Data Generation ◽

Human Primate ◽

Study Species

Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r 2 = 0.37; p = 6.15 × 10 −12 ) and number of medical publications ( r 2 = 0.27; p = 9.27 × 10 −9 ). In a generalized linear model, the number of non-medical publications ( p = 0.00064) and closer phylogenetic distance to humans ( p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.

Download Full-text

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Bioinformatics ◽

10.1093/bioinformatics/btz144 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3826-3828 ◽

Cited By ~ 9

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

General Purpose ◽

Supplementary Information ◽

File Format ◽

Storage Space ◽

Supplementary Data ◽

Network Transmission

Abstract Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOME INTRIGUING HIGH-THROUGHPUT DNA SEQUENCE VARIANTS PREDICTION OVER PROTEIN FUNCTIONALITY

Jurnal Teknologi ◽

10.11113/jt.v78.8967 ◽

2016 ◽

Vol 78 (6-4) ◽

Author(s):

Atabak Kheirkhah ◽

Salwani Mohd Daud ◽

Noor Azurati Ahmad @ Salleh ◽

Suriani Mohd Sam ◽

Hafiza Abas ◽

...

Keyword(s):

Genetic Algorithm ◽

Dna Sequence ◽

High Throughput ◽

Protein Function ◽

Sequence Data ◽

Fundamental Problem ◽

Sequence Information ◽

Sequence Variants ◽

Sequence Prediction ◽

Dna Sequence Variants

This paper intends to review computational methods and high throughput automated tools for precisely prediction various functionalities of uncharacterized proteins based on their desired DNA sequence information alone. Then proposes a hybrid weighted network and Genetic Algorithm to improve prediction purpose. The main advantage of the method is the protein function and DNA sequence prediction can be computed precisely using best fitness parent in genetic algorithm. With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased exponentially and the pace is much slower in determining their biological attributes. The gap between DNA sequence variants and their functionalities has become increasingly large. However, detection of sequences based on protein data bank has become benchmark for many researchers. As amount of DNA sequence data continues to increase, the fundamental problem stay at the front of genome analysis. In the course of developing these methods, the following matters were often needed to consider: benchmark dataset construction, gene sequence prediction, operating algorithm, anticipated accuracy, gene recommender and functional integrations. In this review, we are to discuss each of them, with a different focus on operational algorithms and how to increase the accuracy of DNA sequence variants prediction.

Download Full-text

High-throughput DNA sequence data compression

Briefings in Bioinformatics ◽

10.1093/bib/bbt087 ◽

2013 ◽

Vol 16 (1) ◽

pp. 1-15 ◽

Cited By ~ 51

Author(s):

Z. Zhu ◽

Y. Zhang ◽

Z. Ji ◽

S. He ◽

X. Yang

Keyword(s):

Data Compression ◽

Dna Sequence ◽

High Throughput ◽

Sequence Data ◽

Dna Sequence Data

Download Full-text

Advances in high throughput DNA sequence data compression

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016300021 ◽

2016 ◽

Vol 14 (03) ◽

pp. 1630002 ◽

Cited By ~ 13

Author(s):

Muhammad Sardaraz ◽

Muhammad Tahir ◽

Ataul Aziz Ikram

Keyword(s):

Data Compression ◽

Dna Sequence ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequence Data ◽

Sequencing Data ◽

Research Directions ◽

Compression Algorithms ◽

Dna Sequence Data ◽

Sequencing Technologies

Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.

Download Full-text

SSR_pipeline--computer software for the identification of microsatellite sequences from paired-end Illumina high-throughput DNA sequence data

Data Series ◽

10.3133/ds778 ◽

2013 ◽

Author(s):

Mark P. Miller ◽

Brian J. Knaus ◽

Thomas D. Mullins ◽

Susan M. Haig

Keyword(s):

Dna Sequence ◽

High Throughput ◽

Sequence Data ◽

Computer Software ◽

Dna Sequence Data

Download Full-text

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

10.1101/501130 ◽

2018 ◽

Author(s):

Kirill Kryukov ◽

Mahoko Takahashi Ueda ◽

So Nakagawa ◽

Tadashi Imanishi

Keyword(s):

Open Source ◽

Dna Sequence ◽

Compression Ratio ◽

Dna Sequences ◽

Public Domain ◽

General Purpose ◽

Nucleotide Sequences ◽

File Format ◽

Storage Space ◽

Network Transmission

AbstractSummaryDNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF) – a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli, and zstd.AvailabilityNAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any [email protected]

Download Full-text