scholarly journals Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

2019 ◽  
Vol 4 ◽  
pp. 50 ◽  
Author(s):  
Ernesto Lowy-Gallego ◽  
Susan Fairley ◽  
Xiangqun Zheng-Bradley ◽  
Magali Ruffier ◽  
Laura Clarke ◽  
...  

We present biallelic SNVs called from 2,548 samples across 26 populations from the 1000 Genomes Project, called directly on GRCh38. We believe this will be a useful reference resource for those using GRCh38, representing an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date and providing a resource necessary for the full adoption of GRCh38 by the community. Here, we describe how the call set was created and provide benchmarking data describing how our call set compares to that produced by the final phase of the 1000 Genomes Project on GRCh37.

2019 ◽  
Vol 4 ◽  
pp. 50 ◽  
Author(s):  
Ernesto Lowy-Gallego ◽  
Susan Fairley ◽  
Xiangqun Zheng-Bradley ◽  
Magali Ruffier ◽  
Laura Clarke ◽  
...  

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254363
Author(s):  
Aji John ◽  
Kathleen Muenzen ◽  
Kristiina Ausmees

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.


2015 ◽  
Vol 5 (1) ◽  
Author(s):  
Steven Gazal ◽  
Mourad Sahbatou ◽  
Marie-Claude Babron ◽  
Emmanuelle Génin ◽  
Anne-Louise Leutenegger

Author(s):  
Taedong Yun ◽  
Helen Li ◽  
Pi-Chuan Chang ◽  
Michael F. Lin ◽  
Andrew Carroll ◽  
...  

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.


2012 ◽  
Vol 9 (5) ◽  
pp. 459-462 ◽  
Author(s):  
Laura Clarke ◽  
◽  
Xiangqun Zheng-Bradley ◽  
Richard Smith ◽  
Eugene Kulesha ◽  
...  

2020 ◽  
Author(s):  
Peter Pfaffelhuber ◽  
Elisabeth Sester-Huss ◽  
Franz Baumdicker ◽  
Jana Naue ◽  
Sabine Lutz-Bonengel ◽  
...  

AbstractThe inference of biogeographic ancestry (BGA) has become a focus of forensic genetics. Mis-inference of BGA can have profound unwanted consequences for investigations and society. We show that recent admixture can lead to misclassification and erroneous inference of ancestry proportions, using state of the art analysis tools with (i) simulations, (ii) 1000 genomes project data, and (iii) two individuals analyzed using the ForenSeq DNA Signature Prep Kit. Subsequently, we extend existing tools for estimation of individual ancestry (IA) by allowing for different IA in both parents, leading to estimates of parental individual ancestry (PIA), and a statistical test for recent admixture. Estimation of PIA outperforms IA in most scenarios of recent admixture. Furthermore, additional information about parental ancestry can be acquired with PIA that may guide casework.


PLoS ONE ◽  
2014 ◽  
Vol 9 (1) ◽  
pp. e85899 ◽  
Author(s):  
Giuseppe Indolfi ◽  
Giusi Mangone ◽  
Elisa Bartolini ◽  
Gabriella Nebbia ◽  
Pier Luigi Calvo ◽  
...  

2019 ◽  
Vol 48 (D1) ◽  
pp. D941-D947 ◽  
Author(s):  
Susan Fairley ◽  
Ernesto Lowy-Gallego ◽  
Emily Perry ◽  
Paul Flicek

Abstract To sustain and develop the largest fully open human genomic resources the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org) was established. It is built on the foundation of the 1000 Genomes Project, which created the largest openly accessible catalogue of human genomic variation developed from samples spanning five continents. IGSR (i) maintains access to 1000 Genomes Project resources, (ii) updates 1000 Genomes Project resources to the GRCh38 human reference assembly, (iii) adds new data generated on 1000 Genomes Project cell lines, (iv) shares data from samples with a similarly open consent to increase the number of samples and populations represented in the resources and (v) provides support to users of these resources. Among recent updates are the release of variation calls from 1000 Genomes Project data calculated directly on GRCh38 and the addition of high coverage sequence data for the 2504 samples in the 1000 Genomes Project phase three panel. The data portal, which facilitates web-based exploration of the IGSR resources, has been updated to include samples which were not part of the 1000 Genomes Project and now presents a unified view of data and samples across almost 5000 samples from multiple studies. All data is fully open and publicly accessible.


Sign in / Sign up

Export Citation Format

Share Document