variant call format
Recently Published Documents


TOTAL DOCUMENTS

28
(FIVE YEARS 18)

H-INDEX

6
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Frank David Vogt ◽  
Gautam Shirsekar ◽  
Detlef Weigel

We present a new software package vcf2gwas to perform reproducible genome-wide association studies (GWAS). vcf2gwas is a Python API for bcftools, PLINK and GEMMA. Before running the analysis a traditional GWAS workflow requires the user to edit and format the genotype information from commonly used Variant Call Format (VCF) file and phenotype information. Post-processing steps involve summarizing and visualizing the analysis results. This workflow requires a user to utilize the command-line, manual text-editing and knowledge of one or more programming/scripting languages which can be time-consuming especially when analyzing multiple phenotypes. Our package provides a convenient pipeline performing all of these steps, reducing the GWAS workflow to a single command-line input without the need to edit or format the VCF file beforehand or to install any additional software. In addition, features like reducing the dimensionality of the phenotypic space and performing analyses on the reduced dimensions or comparing the significant variants from the results to specific genes/regions of interest are implemented. By integrating different tools to perform GWAS under one workflow, the package ensures reproducible GWAS while reducing the user efforts significantly


2021 ◽  
Author(s):  
Erik Garrison ◽  
Zev N Kronenberg ◽  
Eric T Dawson ◽  
Brent S Pedersen ◽  
Pjotr Prins

Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies --- as well as in somatic and germline mutation studies. VCF can present single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called against a reference genome. Here we present over 125 useful and much used free and open source software tools and libraries, part of vcflib tools and bio-vcf. We also highlight cyvcf2, hts-nim and slivar tools. Application is typically in the comparison, filtering, normalisation, smoothing, annotation, statistics, visualisation and exporting of variants. Our tools run daily and invisibly in pipelines and countless shell scripts. Our tools are part of a wider bioinformatics ecosystem and we consider it very important to make these tools available as free and open source software to all bioinformaticians so they can be deployed through software distributions, such as Debian, GNU Guix and Bioconda. vcflib, for example, was installed over 40,000 times and bio-vcf was installed over 15,000 times through Bioconda by December 2020. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation that can not easily be represented by the VCF format. All source code is published under free and open source software licenses and can be downloaded and installed from https://github.com/vcflib.


Author(s):  
Sebastian Deorowicz ◽  
Agnieszka Danek ◽  
Marek Kokot

Abstract Summary Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation https://github.com/refresh-bio/vcfshark. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew S. Lyon ◽  
Shea J. Andrews ◽  
Ben Elsworth ◽  
Tom R. Gaunt ◽  
Gibran Hemani ◽  
...  

AbstractGWAS summary statistics are fundamental for a variety of research applications yet no common storage format has been widely adopted. Existing tabular formats ambiguously or incompletely store information about genetic variants and associations, lack essential metadata and are typically not indexed yielding poor query performance and increasing the possibility of errors in data interpretation and post-GWAS analyses. To address these issues, we adapted the variant call format to store GWAS summary statistics (GWAS-VCF) and developed open-source tools to use this format in downstream analyses. We provide open access to over 10,000 complete GWAS summary datasets converted to this format (https://gwas.mrcieu.ac.uk).


Author(s):  
Michael F Lin ◽  
Xiaodong Bai ◽  
William J Salerno ◽  
Jeffrey G Reid

Abstract Summary Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. Availability and implementation Apache-licensed reference implementation: github.com/mlin/spVCF Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Shan-Shan Dong ◽  
Wei-Ming He ◽  
Jing-Jing Ji ◽  
Chi Zhang ◽  
Yan Guo ◽  
...  

Abstract The triangular correlation heatmap aiming to visualize the linkage disequilibrium (LD) pattern and haplotype block structure of SNPs is ubiquitous component of population-based genetic studies. However, current tools suffered from the problem of time and memory consuming. Here, we developed LDBlockShow, an open source software, for visualizing LD and haplotype blocks from variant call format files. It is time and memory saving. In a test dataset with 100 SNPs from 60 000 subjects, it was at least 10.60 times faster and used only 0.03–13.33% of physical memory as compared with other tools. In addition, it could generate figures that simultaneously display additional statistical context (e.g. association P-values) and genomic region annotations. It can also compress the SVG files with a large number of SNPs and support subgroup analysis. This fast and convenient tool will facilitate the visualization of LD and haplotype blocks for geneticists.


2020 ◽  
Author(s):  
Shan-Shan Dong ◽  
Wei-Ming He ◽  
Jing-Jing Ji ◽  
Chi Zhang ◽  
Yan Guo ◽  
...  

AbstractThe triangular correlation heatmap aiming to visualize the linkage disequilibrium (LD) pattern and haplotype block structure of SNPs is ubiquitous component of population-based genetic studies. However, current tools suffered from the problem of time and memory consuming, and direct calculation from variant call format (VCF) files is not supported. Here we developed LDBlockShow, an open source software, for visualizing LD and haplotype blocks from VCF files. It is time and memory saving. In a test dataset with 100 SNPs from 60,000 subjects, it was at least 429.03 times faster and used only 0.04% – 20.00% of physical memory as compared to other tools. In addition, it could generate figures that simultaneously display additional statistical context (e.g., association P values) and genomic region annotations. It can also compress the SVG files with large number of SNPs and support subgroup analysis. This fast and convenient tool would facilitate the visualization of LD and haplotype blocks for geneticists.


Author(s):  
Matthew Lyon ◽  
Shea J Andrews ◽  
Ben Elsworth ◽  
Tom R Gaunt ◽  
Gibran Hemani ◽  
...  

Genome-wide association study (GWAS) summary statistics are a fundamental resource for a variety of research applications 1–6. Yet despite their widespread utility, no common storage format has been widely adopted, hindering tool development and data sharing, analysis and integration. Existing tabular formats 7,8 often ambiguously or incompletely store information about genetic variants and their associations, and also lack essential metadata increasing the possibility of errors in data interpretation and post-GWAS analyses. Additionally, data in these formats are typically not indexed, requiring the whole file to be read which is computationally inefficient. To address these issues, we propose an adaptation of the variant call format9 (GWAS-VCF) and have produced a suite of open-source tools for using this format in downstream analyses. Simulation studies determine GWAS-VCF is 9-46x faster than tabular alternatives when extracting variant(s) by genomic position. Our results demonstrate the GWAS-VCF provides a robust and performant solution for sharing, analysis and integration of GWAS data. We provide open access to over 10,000 complete GWAS summary datasets converted to this format (available from: https://gwas.mrcieu.ac.uk).


Author(s):  
Andrea Binatti ◽  
Silvia Bresolin ◽  
Stefania Bortoluzzi ◽  
Alessandro Coppe

Abstract Whole exome sequencing (WES) is a powerful approach for discovering sequence variants in cancer cells but its time effectiveness is limited by the complexity and issues of WES data analysis. Here we present iWhale, a customizable pipeline based on Docker and SCons, reliably detecting somatic variants by three complementary callers (MuTect2, Strelka2 and VarScan2). The results are combined to obtain a single variant call format file for each sample and variants are annotated by integrating a wide range of information extracted from several reference databases, ultimately allowing variant and gene prioritization according to different criteria. iWhale allows users to conduct a complex series of WES analyses with a powerful yet customizable and easy-to-use tool, running on most operating systems (macOs, GNU/Linux and Windows). iWhale code is freely available at https://github.com/alexcoppe/iWhale and the docker image is downloadable from https://hub.docker.com/r/alexcoppe/iwhale.


Sign in / Sign up

Export Citation Format

Share Document