ScaleQC: A Scalable Lossy to Lossless Solution for NGS Sequencing Data Compression

Mapping Intimacies ◽

10.1101/2020.02.09.940932 ◽

2020 ◽

Author(s):

Rogshan Yu ◽

Wenxian Yang

Keyword(s):

State Of The Art ◽

Lossless Compression ◽

Sequencing Data ◽

Source Codes ◽

Compression Performance ◽

Link Type ◽

File Formats ◽

Data Rates ◽

Special Quality ◽

Bit Stream

AbstractMotivationPer-base quality values in NGS sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications.ResultsWe developed a scalable lossy to lossless compression solution for quality values named ScaleQC. ScaleQC is able to provide bit-stream level scalability. More specifically, the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without re-encoding. Despite its scalability, ScaleQC still achieves same or better compression performance at both lossless and lossy data rates compared to the state-of-the-art lossless or lossy compressors.AvailabilityScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib).

Download Full-text

ScaleQC: a scalable lossy to lossless solution for NGS data compression

Bioinformatics ◽

10.1093/bioinformatics/btaa543 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4551-4559 ◽

Cited By ~ 1

Author(s):

Rongshan Yu ◽

Wenxian Yang

Keyword(s):

Lossless Compression ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Source Codes ◽

Compression Performance ◽

Data Rates ◽

Quality Value ◽

Ngs Data ◽

Bit Stream

Abstract Motivation Per-base quality values in Next Generation Sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications, lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications. Results We developed a scalable lossy to lossless compression solution for quality values named ScaleQC (Scalable Quality value Compression). ScaleQC is able to provide the so-called bit-stream level scalability that the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without incurring an expensive transcoding operation. Despite its scalability, ScaleQC still achieves comparable compression performance at both lossless and lossy data rates compared to the existing lossless or lossy compressors. Availability and implementation ScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quark enables semi-reference-based compression of RNA-seq data

10.1101/085878 ◽

2016 ◽

Author(s):

Hirak Sarkar ◽

Rob Patro

Keyword(s):

State Of The Art ◽

Reference Sequence ◽

Rna Seq ◽

Sequencing Data ◽

The Past ◽

Link Type ◽

Exponential Increase

AbstractMotivationThe past decade has seen an exponential increase in biological sequencing capacity, and there has been a simultaneous effort to help organize and archive some of the vast quantities of sequencing data that are being generated. While these developments are tremendous from the perspective of maximizing the scientific utility of available data, they come with heavy costs. The storage and transmission of such vast amounts of sequencing data is expensive.ResultsWe present Quark, a semi-reference-based compression tool designed for RNA-seq data. Quark makes use of a reference sequence when encoding reads, but produces a representation that can be decoded independently, without the need for a reference. This allows Quark to achieve markedly better compression rates than existing reference-free schemes, while still relieving the burden of assuming a specific, shared reference sequence between the encoder and decoder. We demonstrate that Quark achieves state-of-the-art compression rates, and that, typically, only a small fraction of the reference sequence must be encoded along with the reads to allow reference-free decompression.AvailabilityQuark is implemented in C++11, and is available under a GPLv3 license at www.github.com/COMBINE-lab/[email protected]

Download Full-text

Sparse Binary Relation Representations for Genome Graph Annotation

10.1101/468512 ◽

2018 ◽

Cited By ~ 1

Author(s):

Mikhail Karasikov ◽

Harun Mustafa ◽

Amir Joudaki ◽

Sara Javadzadeh-No ◽

Gunnar Rätsch ◽

...

Keyword(s):

Real World ◽

Input Data ◽

State Of The Art ◽

Graph Labeling ◽

De Bruijn Graph ◽

Sequencing Data ◽

Real World Data ◽

Compression Performance ◽

De Bruijn ◽

Real World Datasets

AbstractHigh-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.

Download Full-text

High-Performance Lossless Compression of Hyperspectral Remote Sensing Scenes Based on Spectral Decorrelation

Remote Sensing ◽

10.3390/rs12182955 ◽

2020 ◽

Vol 12 (18) ◽

pp. 2955

Author(s):

Miguel Hernández-Cabronero ◽

Jordi Portell ◽

Ian Blanes ◽

Joan Serra-Sagristà

Keyword(s):

Remote Sensing ◽

High Performance ◽

Lossless Compression ◽

Low Complexity ◽

The State ◽

Trade Off ◽

Compression Performance ◽

Data Rates ◽

Major Bottleneck ◽

Compressed Data

The capacity of the downlink channel is a major bottleneck for applications based on remote sensing hyperspectral imagery (HSI). Data compression is an essential tool to maximize the amount of HSI scenes that can be retrieved on the ground. At the same time, energy and hardware constraints of spaceborne devices impose limitations on the complexity of practical compression algorithms. To avoid any distortion in the analysis of the HSI data, only lossless compression is considered in this study. This work aims at finding the most advantageous compression–complexity trade-off within the state of the art in HSI compression. To do so, a novel comparison of the most competitive spectral decorrelation approaches combined with the best performing low-complexity compressors of the state is presented. Compression performance and execution time results are obtained for a set of 47 HSI scenes produced by 14 different sensors in real remote sensing missions. Assuming only a limited amount of energy is available, obtained data suggest that the FAPEC algorithm yields the best trade-off. When compared to the CCSDS 123.0-B-2 standard, FAPEC is 5.0 times faster and its compressed data rates are on average within 16% of the CCSDS standard. In scenarios where energy constraints can be relaxed, CCSDS 123.0-B-2 yields the best average compression results of all evaluated methods.

Download Full-text

VCF2CNA: A tool for efficiently detecting copy-number alterations in VCF genotype data

10.1101/131235 ◽

2017 ◽

Cited By ~ 1

Author(s):

Daniel K. Putnam ◽

Ma Xiaotu ◽

Stephen V. Rice ◽

Yu Liu ◽

Jinghui Zhang ◽

...

Keyword(s):

Copy Number ◽

Copy Number Alteration ◽

State Of The Art ◽

Sequence Data ◽

Copy Number Alterations ◽

Web Interface ◽

Sequencing Data ◽

File Formats ◽

Complete Genomics ◽

High Level

AbstractVCF2CNA is a web interface tool for copy-number alteration (CNA) analysis of VCF and other variant file formats. We applied it to 46 adult glioblastoma and 146 pediatric neuroblastoma samples sequenced by Illumina and Complete Genomics (CGI) platforms respectively. VCF2CNA was highly consistent with a state-of-the-art algorithm using raw sequencing data (mean F1-score=0.994) in high-quality glioblastoma samples and was robust to uneven coverage introduced by library artifacts. In the neuroblastoma set, VCF2CNA identified MYCN high-level amplifications in 31 of 32 clinically validated samples compared to 15 found by CGI’s HMM-based CNA model. The findings suggest that VCF2CNA is an accurate, efficient and platform-independent tool for CNA analyses without accessing raw sequence data.

Download Full-text

Efficient Lossless Compression of Multitemporal Hyperspectral Image Data

Journal of Imaging ◽

10.3390/jimaging4120142 ◽

2018 ◽

Vol 4 (12) ◽

pp. 142 ◽

Cited By ~ 6

Author(s):

Hongda Shen ◽

Zhuocheng Jiang ◽

W. Pan

Keyword(s):

Data Compression ◽

Hyperspectral Image ◽

Image Data ◽

Lossless Compression ◽

Large Data ◽

Hyperspectral Data ◽

Compression Performance ◽

Temporal Correlations ◽

Sensing Applications ◽

Data Volume

Hyperspectral imaging (HSI) technology has been used for various remote sensing applications due to its excellent capability of monitoring regions-of-interest over a period of time. However, the large data volume of four-dimensional multitemporal hyperspectral imagery demands massive data compression techniques. While conventional 3D hyperspectral data compression methods exploit only spatial and spectral correlations, we propose a simple yet effective predictive lossless compression algorithm that can achieve significant gains on compression efficiency, by also taking into account temporal correlations inherent in the multitemporal data. We present an information theoretic analysis to estimate potential compression performance gain with varying configurations of context vectors. Extensive simulation results demonstrate the effectiveness of the proposed algorithm. We also provide in-depth discussions on how to construct the context vectors in the prediction model for both multitemporal HSI and conventional 3D HSI data.

Download Full-text

Robust Visibility Surface Determination in Object Space via Plücker Coordinates

Journal of Imaging ◽

10.3390/jimaging7060096 ◽

2021 ◽

Vol 7 (6) ◽

pp. 96

Author(s):

Alessandro Rossi ◽

Marco Barbiero ◽

Paolo Scremin ◽

Ruggero Carli

Keyword(s):

Finite Number ◽

State Of The Art ◽

Lossless Compression ◽

3D Models ◽

Optimal Result ◽

Object Space ◽

Plücker Coordinates ◽

Determination Methods

Industrial 3D models are usually characterized by a large number of hidden faces and it is very important to simplify them. Visible-surface determination methods provide one of the most common solutions to the visibility problem. This study presents a robust technique to address the global visibility problem in object space that guarantees theoretical convergence to the optimal result. More specifically, we propose a strategy that, in a finite number of steps, determines if each face of the mesh is globally visible or not. The proposed method is based on the use of Plücker coordinates that allows it to provide an efficient way to determine the intersection between a ray and a triangle. This algorithm does not require pre-calculations such as estimating the normal at each face: this implies the resilience to normals orientation. We compared the performance of the proposed algorithm against a state-of-the-art technique. Results showed that our approach is more robust in terms of convergence to the maximum lossless compression.

Download Full-text

Complete mitochondrial genome sequence of Labriocimbex sinicus, a new genus and new species of Cimbicidae (Hymenoptera) from China

PeerJ ◽

10.7717/peerj.7853 ◽

2019 ◽

Vol 7 ◽

pp. e7853 ◽

Cited By ~ 1

Author(s):

Yuchen Yan ◽

Gengyun Niu ◽

Yaoyao Zhang ◽

Qianying Ren ◽

Shiyu Du ◽

...

Keyword(s):

Mitochondrial Genome ◽

New Genus ◽

High Throughput Sequencing ◽

Phylogenetic Analyses ◽

Complete Mitochondrial Genome ◽

Sister Group ◽

Morphological Characters ◽

Trna Genes ◽

Sequencing Data ◽

Link Type

Labriocimbex sinicus Yan & Wei gen. et sp. nov. of Cimbicidae is described. The new genus is similar to Praia Andre and Trichiosoma Leach. A key to extant Holarctic genera of Cimbicinae is provided. To identify the phylogenetic placement of Cimbicidae, the mitochondrial genome of L. sinicus was annotated and characterized using high-throughput sequencing data. The complete mitochondrial genome of L. sinicus was obtained with a length of 15,405 bp (GenBank: MH136623; SRA: SRR8270383) and a typical set of 37 genes (22 tRNAs, 13 PCGs, and two rRNAs). The results demonstrated that all PCGs were initiated by ATN codon, and ended with TAA or T stop codons. The study reveals that all tRNA genes have a typical clover-leaf secondary structure, except for trnS1. Remarkably, the secondary structures of the rrnS and rrnL of L. sinicus were much different from those of Corynis lateralis. Phylogenetic analyses verified the monophyly and positions of the three Cimbicidae species within the superfamily Tenthredinoidea and demonstrated a relationship as (Tenthredinidae + Cimbicidae) + (Argidae + Pergidae) with strong nodal supports. Furthermore, we found that the generic relationships of Cimbicidae revealed by the phylogenetic analyses based on COI genes agree quite closely with the systematic arrangement of the genera based on the morphological characters. Phylogenetic tree based on two methods shows that L. sinicus is the sister group of Praia with high support values. We suggest that Labriocimbex belongs to the tribe Trichiosomini of Cimbicinae based on adult morphology and molecular data. Besides, we suggest to promote the subgenus Asitrichiosoma to be a valid genus.

Download Full-text

DANNP: an efficient artificial neural network pruning tool

PeerJ Computer Science ◽

10.7717/peerj-cs.137 ◽

2017 ◽

Vol 3 ◽

pp. e137 ◽

Cited By ~ 7

Author(s):

Mona Alshahrani ◽

Othman Soufan ◽

Arturo Magana-Mora ◽

Vladimir B. Bajic

Keyword(s):

Neural Network ◽

State Of The Art ◽

Model Performance ◽

Training Data ◽

Classification Problems ◽

Link Type ◽

On Line ◽

Pruning Algorithms ◽

Artificial Neural ◽

The Impact

Background Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, determining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state-of-the-art feature selection (FS) methods. Results Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text