scholarly journals Scaling accurate genetic variant discovery to tens of thousands of samples

Author(s):  
Ryan Poplin ◽  
Valentin Ruano-Rubio ◽  
Mark A. DePristo ◽  
Tim J. Fennell ◽  
Mauricio O. Carneiro ◽  
...  

AbstractComprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HC-RCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies.

2016 ◽  
Vol 48 (9) ◽  
pp. 1071-1076 ◽  
Author(s):  
Eric M Scott ◽  
◽  
Anason Halees ◽  
Yuval Itan ◽  
Emily G Spencer ◽  
...  

2019 ◽  
Vol 36 (8) ◽  
pp. 2569-2571 ◽  
Author(s):  
Cinta Pegueroles ◽  
Verónica Mixão ◽  
Laia Carreté ◽  
Manu Molina ◽  
Toni Gabaldón

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 28 (8) ◽  
pp. 1034-1043 ◽  
Author(s):  
Holger Hengel ◽  
Rebecca Buchert ◽  
Marc Sturm ◽  
Tobias B. Haack ◽  
Yvonne Schelling ◽  
...  

2021 ◽  
Vol 22 (7) ◽  
pp. 3311
Author(s):  
Satish Kumar ◽  
Joanne E. Curran ◽  
Kashish Kumar ◽  
Erica DeLeon ◽  
Ana C. Leandro ◽  
...  

The in vitro modeling of cardiac development and cardiomyopathies in human induced pluripotent stem cell (iPSC)-derived cardiomyocytes (CMs) provides opportunities to aid the discovery of genetic, molecular, and developmental changes that are causal to, or influence, cardiomyopathies and related diseases. To better understand the functional and disease modeling potential of iPSC-differentiated CMs and to provide a proof of principle for large, epidemiological-scale disease gene discovery approaches into cardiomyopathies, well-characterized CMs, generated from validated iPSCs of 12 individuals who belong to four sibships, and one of whom reported a major adverse cardiac event (MACE), were analyzed by genome-wide mRNA sequencing. The generated CMs expressed CM-specific genes and were highly concordant in their total expressed transcriptome across the 12 samples (correlation coefficient at 95% CI =0.92 ± 0.02). The functional annotation and enrichment analysis of the 2116 genes that were significantly upregulated in CMs suggest that generated CMs have a transcriptomic and functional profile of immature atrial-like CMs; however, the CMs-upregulated transcriptome also showed high overlap and significant enrichment in primary cardiomyocyte (p-value = 4.36 × 10−9), primary heart tissue (p-value = 1.37 × 10−41) and cardiomyopathy (p-value = 1.13 × 10−21) associated gene sets. Modeling the effect of MACE in the generated CMs-upregulated transcriptome identified gene expression phenotypes consistent with the predisposition of the MACE-affected sibship to arrhythmia, prothrombotic, and atherosclerosis risk.


2020 ◽  
Author(s):  
Viacheslav Mylka ◽  
Jeroen Aerts ◽  
Irina Matetovici ◽  
Suresh Poovathingal ◽  
Niels Vandamme ◽  
...  

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.


2021 ◽  
Author(s):  
Brice Letcher ◽  
Martin Hunt ◽  
Zamin Iqbal

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.


Genetics ◽  
1975 ◽  
Vol 80 (2) ◽  
pp. 375-394
Author(s):  
C F Wehrhahn

Abstract Most of the models of population genetics are not realistic when applied to data on electrophoretic variants of proteins because the same net charge may result from any of several amino acid combinations. In the absence of realistic models they have, however, been widely used to test competing hypotheses about the origin and maintenance of genetic variation in populations. In this paper I present a general method for determining probability generating functions for electrophoretic state differences. Then I use the method to find allelic state difference distributions for selectively similar electrophoretically detectable alleles in finite natural populations. Predicted patterns of genetic variation, both within and among species, are in reasonable accord with those found in the Drosophila willistoni group by Ayala et al. (1972) and by Ayala and Tracey (1974).


Sign in / Sign up

Export Citation Format

Share Document