CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores

Albert No; Mikel Hernaez; Idoia Ochoa

doi:10.1142/s0219720020500316

CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020500316 ◽

2020 ◽

Vol 18 (06) ◽

pp. 2050031

Author(s):

Albert No ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Variant Calling ◽

Rate Distortion ◽

Original Data ◽

Lossy Compression ◽

High Rate ◽

Sequencing Data ◽

Comparable Rate ◽

Sequencing Technologies ◽

Comparable Performance ◽

Compressed Data

The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the “infinitesimal successive refinability” property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.

Download Full-text

Effect of lossy compression of quality scores on variant calling

10.1101/029843 ◽

2015 ◽

Cited By ~ 1

Author(s):

Idoia Ochoa ◽

Mikel Hernaez ◽

Rachel Goldfeder ◽

Tsachy Weissman ◽

Euan Ashley

Keyword(s):

Dna Sequencing ◽

Consensus Sequence ◽

Variant Calling ◽

Simulated Data ◽

Genomic Data ◽

Original Data ◽

Lossy Compression ◽

Sequencing Data ◽

Indel Detection ◽

The Cost

Recent advancements in sequencing technology have led to a drastic reduction in the cost of genome sequencing. This development has generated an unprecedented amount of genomic data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs from next-generation DNA sequencing data use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. We analyze several lossy compressors introduced recently in the literature. Specifically, we investigate how the output of the variant caller when using the original data (uncompressed) differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets such as the GIAB (Genome In A Bottle) consensus sequence for NA12878 and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the uncompressed data. Further, in some cases lossy compression can lead to variant calling performance which is superior to that using the uncompressed file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors. The \emph{Supplementary Data} can be found at \url{http://web.stanford.edu/~iochoa/supplementEffectLossy.zip}.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

NGSphy: phylogenomic simulation of next-generation sequencing data

10.1101/197715 ◽

2017 ◽

Author(s):

Merly Escalona ◽

Sara Rocha ◽

David Posada

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Gene Families ◽

Common Species ◽

Next Generation Sequencing Data ◽

Phylogenomic Analysis ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

AbstractMotivationAdvances in sequencing technologies have made it feasible to obtain massive datasets for phylogenomic inference, often consisting of large numbers of loci from multiple species and individuals. The phylogenomic analysis of next-generation sequencing (NGS) data implies a complex computational pipeline where multiple technical and methodological decisions are necessary that can influence the final tree obtained, like those related to coverage, assembly, mapping, variant calling and/or phasing.ResultsTo assess the influence of these variables we introduce NGSphy, an open-source tool for the simulation of Illumina reads/read counts obtained from haploid/diploid individual genomes with thousands of independent gene families evolving under a common species tree. In order to resemble real NGS experiments, NGSphy includes multiple options to model sequencing coverage (depth) heterogeneity across species, individuals and loci, including off-target or uncaptured loci. For comprehensive simulations covering multiple evolutionary scenarios, parameter values for the different replicates can be sampled from user-defined statistical distributions.AvailabilitySource code, full documentation and tutorials including a quick start guide are available at http://github.com/merlyescalona/[email protected]. [email protected]

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

Lossless Image Compression Schemes: A Review

Journal of Scientific Research and Reports ◽

10.9734/jsrr/2021/v27i630398 ◽

2021 ◽

pp. 14-22

Author(s):

I. Manga ◽

E. J. Garba ◽

A. S. Ahmadu

Keyword(s):

Image Compression ◽

Data Compression ◽

Image Data ◽

Original Data ◽

Lossy Compression ◽

Video Content ◽

Textual Data ◽

Audio Data ◽

Lossless Data Compression ◽

Compressed Data

Data compression refers to the process of representation of data using fewer number of bits. Data compression can be lossless or lossy. There are many schemes developed and used to perform either lossless or lossy compression. Lossless data compression allows the original data be conveniently reconstructed from the compressed data while lossy compression allow only an approximate of the original data to be constructed. The type of data to compressed can be classified as image data, textual data, audio data or even video content. Various researches are being carried out in the area of image compression. This paper presents various literatures in field of data compression and the techniques used to compress image using lossless type of compression. In conclusion, the paper reviewed some schemes used to compress an image using a single schemes or combination of two or more schemes methods.

Download Full-text

Clair: Exploring the limit of using a deep neural network on pileup data for germline variant calling

10.1101/865782 ◽

2019 ◽

Cited By ~ 5

Author(s):

Ruibang Luo ◽

Chak-Lim Wong ◽

Yat-Sing Wong ◽

Chi-Ian Tang ◽

Chi-Man Liu ◽

...

Keyword(s):

Single Molecule ◽

Deep Neural Network ◽

Deep Neural Networks ◽

New Technologies ◽

Variant Calling ◽

Epigenetic Mark ◽

Sequencing Data ◽

Single Molecule Sequencing ◽

Sequencing Technologies ◽

Complex Genome

AbstractSingle-molecule sequencing technologies have emerged in recent years and revolutionized structural variant calling, complex genome assembly, and epigenetic mark detection. However, the lack of a highly accurate small variant caller has limited the new technologies from being more widely used. In this study, we present Clair, the successor to Clairvoyante, a program for fast and accurate germline small variant calling, using single molecule sequencing data. For ONT data, Clair achieves the best precision, recall and speed as compared to several competing programs, including Clairvoyante, Longshot and Medaka. Through studying the missed variants and benchmarking intentionally overfitted models, we found that Clair may be approaching the limit of possible accuracy for germline small variant calling using pileup data and deep neural networks. Clair requires only a conventional CPU for variant calling and is an open source project available at https://github.com/HKU-BAL/Clair.

Download Full-text

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

10.1101/2021.11.08.467510 ◽

2021 ◽

Author(s):

Hana Rozhoñová ◽

Daniel Danciu ◽

Stefan Stark ◽

Gunnar Rätsch ◽

Andr&eacute Kahles ◽

...

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Variant Calling ◽

Bayesian Filtering ◽

Sequencing Data ◽

Single Nucleotide ◽

Sequencing Technologies ◽

The Cost ◽

Low Coverage ◽

Clonal Composition

Recently developed single-cell DNA sequencing technologies enable whole-genome, amplifi-cation-free sequencing of thousands of cells at the cost of ultra-low coverage of the sequenced data(<0.05x per cell), which mostly limits their usage to the identification of copy number alterations(CNAs) in multi-megabase segments. Aside from CNA-based subclone detection, single-nucleotide vari-ant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumorheterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible whensuperimposing the sequenced genomes of hundreds of genetically similar cells. Here we present SingleCell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells basedsolely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. The core aspects ofthe method are an efficient Bayesian filtering of relevant loci and the exploitation of read overlapsand phasing information. We applied SECEDO to a synthetic dataset simulating 7,250 cells and eighttumor subclones from a single patient, and were able to accurately reconstruct the clonal composition,detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the totalpopulation. When applied to four real single-cell sequencing datasets from a breast cancer patient,SECEDO was able to recover the major clonal composition in each dataset at the original sequencingdepth of 0.03x per cell, an 8-fold improvement relative to the state of the art. Variant calling on theresulting clusters recovered more than twice as many SNVs with double the allelic ratio compared tocalling on all cells together, demonstrating the utility of SECEDO. SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo.

Download Full-text

OVarFlow: a resource optimized GATK 4 based Open source Variant calling workFlow

10.1101/2021.05.12.443585 ◽

2021 ◽

Author(s):

Jochen Bathke ◽

Gesine Lühken

Keyword(s):

Next Generation Sequencing ◽

Best Practices ◽

Best Practice ◽

Variant Calling ◽

Data Evaluation ◽

Phenotypic Trait ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

Background Next generation sequencing technologies are opening new doors to researchers. One application is the direct discovery of sequence variants that are causative for a phenotypic trait or a disease. The detection of an organisms alterations from a reference genome is know as variant calling, a computational task involving a complex chain of software applications. One key player in the field is the Genome Analysis Toolkit (GATK). The GATK Best Practices are commonly referred recipe for variant calling on human sequencing data. Still the fact the Best Practices are highly specialized on human sequencing data and are permanently evolving is often ignored. Reproducibility is thereby aggravated, leading to continuous reinvention of pretended GATK Best Practice workflows. Results Here we present an automatized variant calling workflow, for the detection of SNPs and indels, that is broadly applicable for model as well as non-model diploid organisms. It is derived from the GATK Best Practice workflow for "Germline short variant discovery", without being focused on human sequencing data. The workflow has been highly optimized to achieve parallelized data evaluation and also maximize performance of individual applications to shorten overall analysis time. Optimized Java garbage collection and heap size settings for the GATK applications SortSam, MarkDuplicates, HaplotypeCaller and GatherVcfs were determined by thorough benchmarking. In doing so, runtimes of an example data evaluation could be reduced from 67 h to less than 35 h. Conclusions The demand for standardized variant calling workflows is proportionally growing with the dropping costs of next generation sequencing methods. Our workflow perfectly fits into this niche, offering automatization, reproducibility and documentation of the variant calling process. Moreover resource usage is lowered to a minimum. Thereby variant calling projects should become more standardized, reducing the barrier further for smaller institutions or groups.

Download Full-text

Z-checker: A framework for assessing lossy compression of scientific data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017737147 ◽

2017 ◽

Vol 33 (2) ◽

pp. 285-303 ◽

Cited By ~ 7

Author(s):

Dingwen Tao ◽

Sheng Di ◽

Hanqi Guo ◽

Zizhong Chen ◽

Franck Cappello

Keyword(s):

Mean Squared Error ◽

Signal To Noise Ratio ◽

Rate Distortion ◽

Principal Component ◽

Original Data ◽

Lossy Compression ◽

Scientific Data ◽

Average Error ◽

Data Sets ◽

Data Set

Because of the vast volume of data being produced by today’s scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific data sets and understand the data alteration after compression in a systematic and reliable way. To address this gap, we have designed and implemented a generic framework called Z-checker. On the one hand, Z-checker combines a battery of data analysis components for data compression. On the other hand, Z-checker is implemented as an open-source community tool to which users and developers can contribute and add new analysis components based on their additional analysis demands. In this article, we present a survey of existing lossy compressors. Then, we describe the design framework of Z-checker, in which we integrated evaluation metrics proposed in prior work as well as other analysis tools. Specifically, for lossy compressor developers, Z-checker can be used to characterize critical properties (such as entropy, distribution, power spectrum, principal component analysis, and autocorrelation) of any data set to improve compression strategies. For lossy compression users, Z-checker can detect the compression quality (compression ratio and bit rate) and provide various global distortion analysis comparing the original data with the decompressed data (peak signal-to-noise ratio, normalized mean squared error, rate–distortion, rate-compression error, spectral, distribution, and derivatives) and statistical analysis of the compression error (maximum, minimum, and average error; autocorrelation; and distribution of errors). Z-checker can perform the analysis with either coarse granularity (throughout the whole data set) or fine granularity (by user-defined blocks), such that the users and developers can select the best fit, adaptive compressors for different parts of the data set. Z-checker features a visualization interface displaying all analysis results in addition to some basic views of the data sets such as time series. To the best of our knowledge, Z-checker is the first tool designed to assess lossy compression comprehensively for scientific data sets.

Download Full-text

Machine Learning Approach for Confirmation of COVID-19 Cases: Positive, Negative, Death and Release (Preprint)

10.2196/preprints.19526 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samir Bandyopadhyay Sr ◽

SHAWNI DUTTA ◽

SHAWNI DUTTA

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Health Workers ◽

Original Data ◽

High Rate ◽

Learning Approach ◽

Good Decision ◽

Number Of Patients ◽

Machine Learning Approach ◽

Death Cases

BACKGROUND In recent days, Covid-19 coronavirus has been an immense impact on social, economic fields in the world. The objective of this study determines if it is feasible to use machine learning method to evaluate how much prediction results are close to original data related to Confirmed-Negative-Released-Death cases of Covid-19. For this purpose, a verification method is proposed in this paper that uses the concept of Deep-learning Neural Network. In this framework, Long short-term memory (LSTM) and Gated Recurrent Unit (GRU) are also assimilated finally for training the dataset and the prediction results are tally with the results predicted by clinical doctors. The prediction results are validated against the original data based on some predefined metric. The experimental results showcase that the proposed approach is useful in generating suitable results based on the critical disease outbreak. It also helps doctors to recheck further verification of virus by the proposed method. The outbreak of Coronavirus has the nature of exponential growth and so it is difficult to control with limited clinical persons for handling a huge number of patients with in a reasonable time. So it is necessary to build an automated model, based on machine learning approach, for corrective measure after the decision of clinical doctors. It could be a promising supplementary confirmation method for frontline clinical doctors. The proposed method has a high prediction rate and works fast for probable accurate identification of the disease. The performance analysis shows that a high rate of accuracy is obtained by the proposed method. OBJECTIVE Validation of COVID-19 disease METHODS Machine Learning RESULTS 90% CONCLUSIONS The combined LSTM-GRU based RNN model provides a comparatively better results in terms of prediction of confirmed, released, negative, death cases on the data. This paper presented a novel method that could recheck occurred cases of COVID-19 automatically. The data driven RNN based model is capable of providing automated tool for confirming, estimating the current position of this pandemic, assessing the severity, and assisting government and health workers to act for good decision making policy. It could be a promising supplementary rechecking method for frontline clinical doctors. It is now essential for improving the accuracy of detection process. CLINICALTRIAL 2020-04-03 3:22:36 PM

Download Full-text