scholarly journals VCFShark: how to squeeze a VCF file

Author(s):  
Sebastian Deorowicz ◽  
Agnieszka Danek ◽  
Marek Kokot

Abstract Summary Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation https://github.com/refresh-bio/vcfshark. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Sebastian Deorowicz ◽  
Agnieszka Danek

AbstractSummaryThe VCF files with results of sequencing projects take a lot of space. We propose VCFShark squeezing them up to an order of magnitude better than the de facto standards (gzipped VCF and BCF).Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.


2020 ◽  
Vol 36 (13) ◽  
pp. 4091-4092
Author(s):  
Divon Lan ◽  
Raymond Tobler ◽  
Yassine Souilmi ◽  
Bastien Llamas

Abstract Motivation genozip is a new lossless compression tool for Variant Call Format (VCF) files. By applying field-specific algorithms and fully utilizing the available computational hardware, genozip achieves the highest compression ratios amongst existing lossless compression tools known to the authors, at speeds comparable with the fastest multi-threaded compressors. Availability and implementation genozip is freely available to non-commercial users. It can be installed via conda-forge, Docker Hub, or downloaded from github.com/divonlan/genozip. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Shixu He ◽  
Zhibo Huang ◽  
Xiaohan Wang ◽  
Lin Fang ◽  
Shengkang Li ◽  
...  

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Robert J. Vickerstaff ◽  
Richard J. Harrison

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.


2020 ◽  
Vol 36 (16) ◽  
pp. 4519-4520
Author(s):  
Ying Zhou ◽  
Sharon R Browning ◽  
Brian L Browning

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (14) ◽  
pp. 4163-4170
Author(s):  
Francisco Guil ◽  
José F Hidalgo ◽  
José M García

Abstract Motivation Elementary flux modes (EFMs) are a key tool for analyzing genome-scale metabolic networks, and several methods have been proposed to compute them. Among them, those based on solving linear programming (LP) problems are known to be very efficient if the main interest lies in computing large enough sets of EFMs. Results Here, we propose a new method called EFM-Ta that boosts the efficiency rate by analyzing the information provided by the LP solver. We base our method on a further study of the final tableau of the simplex method. By performing additional elementary steps and avoiding trivial solutions consisting of two cycles, we obtain many more EFMs for each LP problem posed, improving the efficiency rate of previously proposed methods by more than one order of magnitude. Availability and implementation Software is freely available at https://github.com/biogacop/Boost_LP_EFM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Sebastian Deorowicz

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.


2017 ◽  
Author(s):  
Max Salm ◽  
Sven-Eric Schelhorn ◽  
Lee Lancashire ◽  
Thomas Grombacher

SummaryPatient-derived tumor xenograft (PDX) samples typically represent a mixture of mouse and human tissue. Variant call sets derived from sequencing such samples are commonly contaminated with false positive variants that arise when mouse-derived reads are mapped to the human genome. pdxBlacklist is a novel approach designed to rapidly identify these false-positive variants, and thus significantly improve variant call set quality.Availability:pdxBlacklist is freely available on GitHub: https://github.com/MaxSalm/pdxBlacklistContact:[email protected] information:Supplementary data are available.


2019 ◽  
Vol 35 (22) ◽  
pp. 4791-4793 ◽  
Author(s):  
Sebastian Deorowicz ◽  
Agnieszka Danek

Abstract Summary Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results. Availability and implementation https://github.com/refresh-bio/GTShark. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Michael F Lin ◽  
Xiaodong Bai ◽  
William J Salerno ◽  
Jeffrey G Reid

Abstract Summary Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. Availability and implementation Apache-licensed reference implementation: github.com/mlin/spVCF Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document