Scaling accurate genetic variant discovery to tens of thousands of samples

Mapping Intimacies ◽

10.1101/201178 ◽

2017 ◽

Cited By ~ 218

Author(s):

Ryan Poplin ◽

Valentin Ruano-Rubio ◽

Mark A. DePristo ◽

Tim J. Fennell ◽

Mauricio O. Carneiro ◽

...

Keyword(s):

Population Genetics ◽

Genetic Variation ◽

Rare Diseases ◽

Clinical Studies ◽

Genetic Variant ◽

Disease Gene ◽

Variant Calling ◽

Variant Discovery ◽

Disease Gene Discovery ◽

Exome Aggregation Consortium

AbstractComprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HC-RCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies.

Download Full-text

Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery

Nature Genetics ◽

10.1038/ng.3592 ◽

2016 ◽

Vol 48 (9) ◽

pp. 1071-1076 ◽

Cited By ~ 142

Author(s):

Eric M Scott ◽

◽

Anason Halees ◽

Yuval Itan ◽

Emily G Spencer ◽

...

Keyword(s):

Genetic Variation ◽

Disease Gene ◽

Middle Eastern ◽

Gene Discovery ◽

Disease Gene Discovery

Download Full-text

HaploTypo: a variant-calling pipeline for phased genomes

Bioinformatics ◽

10.1093/bioinformatics/btz933 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2569-2571 ◽

Cited By ~ 3

Author(s):

Cinta Pegueroles ◽

Verónica Mixão ◽

Laia Carreté ◽

Manu Molina ◽

Toni Gabaldón

Keyword(s):

Genetic Variation ◽

Genetic Variant ◽

Reference Genome ◽

Variant Calling ◽

Supplementary Information ◽

Haplotype Structure ◽

Supplementary Data ◽

Heterozygous Variant ◽

Reference Genomes

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Computational tools for prioritizing candidate genes: boosting disease gene discovery.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.722303568.793515871 ◽

2016 ◽

Author(s):

Pietro Chiurazzi

Keyword(s):

Candidate Genes ◽

Disease Gene ◽

Gene Discovery ◽

Computational Tools ◽

Disease Gene Discovery

Download Full-text

First-line exome sequencing in Palestinian and Israeli Arabs with neurological disorders is efficient and facilitates disease gene discovery

European Journal of Human Genetics ◽

10.1038/s41431-020-0609-9 ◽

2020 ◽

Vol 28 (8) ◽

pp. 1034-1043 ◽

Cited By ~ 1

Author(s):

Holger Hengel ◽

Rebecca Buchert ◽

Marc Sturm ◽

Tobias B. Haack ◽

Yvonne Schelling ◽

...

Keyword(s):

Exome Sequencing ◽

Neurological Disorders ◽

Disease Gene ◽

Gene Discovery ◽

First Line ◽

Disease Gene Discovery ◽

Israeli Arabs

Download Full-text

Correction to: First-line exome sequencing in Palestinian and Israeli Arabs with neurological disorders is efficient and facilitates disease gene discovery

European Journal of Human Genetics ◽

10.1038/s41431-021-00909-7 ◽

2021 ◽

Author(s):

Holger Hengel ◽

Rebecca Buchert ◽

Marc Sturm ◽

Tobias B. Haack ◽

Yvonne Schelling ◽

...

Keyword(s):

Exome Sequencing ◽

Neurological Disorders ◽

Disease Gene ◽

Gene Discovery ◽

First Line ◽

Disease Gene Discovery ◽

Israeli Arabs

Download Full-text

The global population genetics of Dengue viruses revealed through temporal and spatial mapping of viral genetic variation

International Journal of Infectious Diseases ◽

10.1016/j.ijid.2016.11.377 ◽

2016 ◽

Vol 53 ◽

pp. 154

Author(s):

K.D. Kawashima ◽

H. Akashi

Keyword(s):

Population Genetics ◽

Genetic Variation ◽

Spatial Mapping ◽

Dengue Viruses ◽

Temporal And Spatial ◽

Global Population

Download Full-text

Disease Modeling and Disease Gene Discovery in Cardiomyopathies: A Molecular Study of Induced Pluripotent Stem Cell Generated Cardiomyocytes

International Journal of Molecular Sciences ◽

10.3390/ijms22073311 ◽

2021 ◽

Vol 22 (7) ◽

pp. 3311

Author(s):

Satish Kumar ◽

Joanne E. Curran ◽

Kashish Kumar ◽

Erica DeLeon ◽

Ana C. Leandro ◽

...

Keyword(s):

Stem Cell ◽

Pluripotent Stem Cell ◽

Disease Gene ◽

Disease Modeling ◽

Gene Discovery ◽

Induced Pluripotent Stem Cell ◽

P Value ◽

Disease Gene Discovery ◽

Induced Pluripotent

The in vitro modeling of cardiac development and cardiomyopathies in human induced pluripotent stem cell (iPSC)-derived cardiomyocytes (CMs) provides opportunities to aid the discovery of genetic, molecular, and developmental changes that are causal to, or influence, cardiomyopathies and related diseases. To better understand the functional and disease modeling potential of iPSC-differentiated CMs and to provide a proof of principle for large, epidemiological-scale disease gene discovery approaches into cardiomyopathies, well-characterized CMs, generated from validated iPSCs of 12 individuals who belong to four sibships, and one of whom reported a major adverse cardiac event (MACE), were analyzed by genome-wide mRNA sequencing. The generated CMs expressed CM-specific genes and were highly concordant in their total expressed transcriptome across the 12 samples (correlation coefficient at 95% CI =0.92 ± 0.02). The functional annotation and enrichment analysis of the 2116 genes that were significantly upregulated in CMs suggest that generated CMs have a transcriptomic and functional profile of immature atrial-like CMs; however, the CMs-upregulated transcriptome also showed high overlap and significant enrichment in primary cardiomyocyte (p-value = 4.36 × 10−9), primary heart tissue (p-value = 1.37 × 10−41) and cardiomyopathy (p-value = 1.13 × 10−21) associated gene sets. Modeling the effect of MACE in the generated CMs-upregulated transcriptome identified gene expression phenotypes consistent with the predisposition of the MACE-affected sibship to arrhythmia, prothrombotic, and atherosclerosis risk.

Download Full-text

Comparative analysis of antibody- and lipid-based multiplexing methods for single-cell RNA-seq

10.1101/2020.11.16.384222 ◽

2020 ◽

Author(s):

Viacheslav Mylka ◽

Jeroen Aerts ◽

Irina Matetovici ◽

Suresh Poovathingal ◽

Niels Vandamme ◽

...

Keyword(s):

Genetic Variation ◽

Comparative Analysis ◽

Single Cell ◽

Cell Lines ◽

Clinical Studies ◽

Clinical Samples ◽

Rna Seq ◽

Batch Effects ◽

Single Cell Sequencing ◽

Single Nucleus

ABSTRACTMultiplexing of samples in single-cell RNA-seq studies allows significant reduction of experimental costs, straightforward identification of doublets, increased cell throughput, and reduction of sample-specific batch effects. Recently published multiplexing techniques using oligo-conjugated antibodies or - lipids allow barcoding sample-specific cells, a process called ‘hashing’. Here, we compare the hashing performance of TotalSeq-A and -C antibodies, custom synthesized lipids and MULTI-seq lipid hashes in four cell lines, both for single-cell RNA-seq and single-nucleus RNA-seq. Hashing efficiency was evaluated using the intrinsic genetic variation of the cell lines. Benchmarking of different hashing strategies and computational pipelines indicates that correct demultiplexing can be achieved with both lipid- and antibody-hashed human cells and nuclei, with MULTISeqDemux as the preferred demultiplexing function and antibody-based hashing as the most efficient protocol on cells. Antibody hashing was further evaluated on clinical samples using PBMCs from healthy and SARS-CoV-2 infected patients, where we demonstrate a more affordable approach for large single-cell sequencing clinical studies, while simultaneously reducing batch effects.

Download Full-text

Enabling multiscale variation analysis with genome graphs

10.1101/2021.02.03.429603 ◽

2021 ◽

Author(s):

Brice Letcher ◽

Martin Hunt ◽

Zamin Iqbal

Keyword(s):

Genetic Variation ◽

Directed Acyclic Graph ◽

Structural Variation ◽

Reference Genome ◽

Multiple Scales ◽

State Of The Art ◽

Variant Calling ◽

Variation Analysis ◽

New Algorithms ◽

Genome Graph

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.

Download Full-text

THE EVOLUTION OF SELECTIVELY SIMILAR ELECTRO-PHORETICALLY DETECTABLE ALLELES IN FINITE NATURAL POPULATIONS

Genetics ◽

10.1093/genetics/80.2.375 ◽

1975 ◽

Vol 80 (2) ◽

pp. 375-394

Author(s):

C F Wehrhahn

Keyword(s):

Population Genetics ◽

Genetic Variation ◽

Generating Functions ◽

Natural Populations ◽

Net Charge ◽

Electrophoretic Variants ◽

Allelic State ◽

Competing Hypotheses ◽

State Difference ◽

General Method

Abstract Most of the models of population genetics are not realistic when applied to data on electrophoretic variants of proteins because the same net charge may result from any of several amino acid combinations. In the absence of realistic models they have, however, been widely used to test competing hypotheses about the origin and maintenance of genetic variation in populations. In this paper I present a general method for determining probability generating functions for electrophoretic state differences. Then I use the method to find allelic state difference distributions for selectively similar electrophoretically detectable alleles in finite natural populations. Predicted patterns of genetic variation, both within and among species, are in reasonable accord with those found in the Drosophila willistoni group by Ayala et al. (1972) and by Ayala and Tracey (1974).

Download Full-text