VCFShark: how to squeeze a VCF file

Bioinformatics ◽

10.1093/bioinformatics/btab211 ◽

2021 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek ◽

Marek Kokot

Keyword(s):

Large Datasets ◽

Main Memory ◽

Supplementary Information ◽

Genotype Data ◽

Supplementary Data ◽

Variant Call Format ◽

Variant Call ◽

Order Of Magnitude ◽

Better Than ◽

De Facto Standards

Abstract Summary Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation https://github.com/refresh-bio/vcfshark. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VCFShark: how to squeeze a VCF file

10.1101/2020.12.18.423437 ◽

2020 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Web Site ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Order Of Magnitude ◽

Better Than ◽

De Facto Standards

AbstractSummaryThe VCF files with results of sequencing projects take a lot of space. We propose VCFShark squeezing them up to an order of magnitude better than the de facto standards (gzipped VCF and BCF).Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

genozip: a fast and efficient compression tool for VCF files

Bioinformatics ◽

10.1093/bioinformatics/btaa290 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4091-4092

Author(s):

Divon Lan ◽

Raymond Tobler ◽

Yassine Souilmi ◽

Bastien Llamas

Keyword(s):

Lossless Compression ◽

Supplementary Information ◽

Supplementary Data ◽

Variant Call Format ◽

Variant Call

Abstract Motivation genozip is a new lossless compression tool for Variant Call Format (VCF) files. By applying field-specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compression ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest multi-threaded compressors. Availability and implementation genozip is freely available to non-commercial users. It can be installed via conda-forge, Docker Hub, or downloaded from github.com/divonlan/genozip. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Bioinformatics ◽

10.1093/bioinformatics/btaa697 ◽

2020 ◽

Author(s):

Shixu He ◽

Zhibo Huang ◽

Xiaohan Wang ◽

Lin Fang ◽

Shengkang Li ◽

...

Keyword(s):

Big Data ◽

Large Volume ◽

Machine Tools ◽

High Performance ◽

Marker Gene ◽

Source Code ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Multiple Sample

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

IBDkin: fast estimation of kinship coefficients from identity by descent segments

Bioinformatics ◽

10.1093/bioinformatics/btaa569 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4519-4520

Author(s):

Ying Zhou ◽

Sharon R Browning ◽

Brian L Browning

Keyword(s):

Software Package ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Uk Biobank ◽

Identity By Descent ◽

Fast Estimation ◽

Kinship Coefficients ◽

Related Individuals ◽

The Uk

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Boosting the extraction of elementary flux modes in genome-scale metabolic networks using the linear programming approach

Bioinformatics ◽

10.1093/bioinformatics/btaa280 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4163-4170

Author(s):

Francisco Guil ◽

José F Hidalgo ◽

José M García

Keyword(s):

Linear Programming ◽

Metabolic Networks ◽

Supplementary Information ◽

Programming Approach ◽

Supplementary Data ◽

Main Interest ◽

Elementary Flux Modes ◽

Order Of Magnitude ◽

Genome Scale ◽

Efficiency Rate

Abstract Motivation Elementary flux modes (EFMs) are a key tool for analyzing genome-scale metabolic networks, and several methods have been proposed to compute them. Among them, those based on solving linear programming (LP) problems are known to be very efficient if the main interest lies in computing large enough sets of EFMs. Results Here, we propose a new method called EFM-Ta that boosts the efficiency rate by analyzing the information provided by the LP solver. We base our method on a further study of the final tableau of the simplex method. By performing additional elementary steps and avoiding trivial solutions consisting of two cycles, we obtain many more EFMs for each LP problem posed, improving the efficiency rate of previously proposed methods by more than one order of magnitude. Availability and implementation Software is freely available at https://github.com/biogacop/Boost_LP_EFM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

pdxBlacklist: Identifying artefactual variants in patient-derived xenograft samples

10.1101/180752 ◽

2017 ◽

Cited By ~ 1

Author(s):

Max Salm ◽

Sven-Eric Schelhorn ◽

Lee Lancashire ◽

Thomas Grombacher

Keyword(s):

Human Genome ◽

Human Tissue ◽

False Positive ◽

Tumor Xenograft ◽

Supplementary Information ◽

Supplementary Data ◽

Variant Call ◽

Patient Derived Xenograft ◽

Novel Approach ◽

Supplementary Material

SummaryPatient-derived tumor xenograft (PDX) samples typically represent a mixture of mouse and human tissue. Variant call sets derived from sequencing such samples are commonly contaminated with false positive variants that arise when mouse-derived reads are mapped to the human genome. pdxBlacklist is a novel approach designed to rapidly identify these false-positive variants, and thus significantly improve variant call set quality.Availability:pdxBlacklist is freely available on GitHub: https://github.com/MaxSalm/pdxBlacklistContact:[email protected] information:Supplementary data are available.

Download Full-text

GTShark: genotype compression in large projects

Bioinformatics ◽

10.1093/bioinformatics/btz508 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4791-4793 ◽

Cited By ~ 2

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Better Than

Abstract Summary Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results. Availability and implementation https://github.com/refresh-bio/GTShark. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sparse project VCF: efficient encoding of population genotype matrices

Bioinformatics ◽

10.1093/bioinformatics/btaa1004 ◽

2020 ◽

Author(s):

Michael F Lin ◽

Xiaodong Bai ◽

William J Salerno ◽

Jeffrey G Reid

Keyword(s):

Rare Variants ◽

Random Access ◽

Supplementary Information ◽

Variant Call Format ◽

Variant Call ◽

Whole Exome ◽

Size Growth ◽

Entropy Reduction ◽

Reference Implementation ◽

Minimal Information

Abstract Summary Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. Availability and implementation Apache-licensed reference implementation: github.com/mlin/spVCF Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text