scholarly journals Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractWe introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.

2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


Diagnostics ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 301
Author(s):  
Jana Mrazkova ◽  
Petr Sistek ◽  
Jan Lochman ◽  
Lydie Izakovicova Holla ◽  
Zdenek Danek ◽  
...  

Mannose-binding lectin (MBL) deficiency caused by the variability in the MBL2 gene is responsible for the susceptibility to and severity of various infectious and autoimmune diseases. A combination of six single nucleotide polymorphisms (SNPs) has a major impact on MBL levels in circulation. The aim of this study is to design and validate a sensitive and economical method for determining MBL2 haplogenotypes. The SNaPshot assay is designed and optimized to genotype six SNPs (rs1800451, rs1800450, rs5030737, rs7095891, rs7096206, rs11003125) and is validated by comparing results with Sanger sequencing. Additionally, an algorithm for online calculation of haplogenotype combinations from the determined genotypes is developed. Three hundred and twenty-eight DNA samples from healthy individuals from the Czech population are genotyped. Minor allele frequencies (MAFs) in the Czech population are in accordance with those present in the European population. The SNaPshot assay for MBL2 genotyping is a high-throughput, cost-effective technique that can be used in further genetic-association studies or in clinical practice. Moreover, a freely available online application for the calculation of haplogenotypes from SNPs is developed within the scope of this project.


2011 ◽  
Vol 9 (2) ◽  
pp. 300-304 ◽  
Author(s):  
S. Negrão ◽  
C. Almadanim ◽  
I. Pires ◽  
K. L. McNally ◽  
M. M. Oliveira

Rice is a salt-sensitive species with enormous genetic variation for salt tolerance hidden in its germplasm pool. The EcoTILLING technique allows us to assign haplotypes, thus reducing the number of accessions to be sequenced, becoming a cost-effective, time-saving and high-throughput method, ideal to be used in laboratories with limited financial resources. Aiming to find alleles associated with salinity tolerance, we are currently using the EcoTILLING technique to detect single nucleotide polymorphisms (SNPs) and small indels across 375 germplasm accessions representing the diversity available in domesticated rice. We are targeting several genes known to be involved in salt stress signal transduction (OsCPK17) or tolerance mechanisms (SalT). So far, we found a total of 15 and 23 representative SNPs or indels in OsCPK17 and SalT, respectively. These natural allelic variants are mostly located in 3′-untranslated region, thus opening a new path for studying their potential contribution to the regulation of gene expression and possible role in salt tolerance.


Blood ◽  
2007 ◽  
Vol 110 (11) ◽  
pp. 1640-1640
Author(s):  
Ulrike Nowak-Gottl ◽  
Hartmut Weiler ◽  
Tanja Seehafer ◽  
Sabine Thedieck ◽  
Monika Stoll

Abstract Background: Fibrinogen, the precursor of fibrin, is an essential component of the hemostatic system. A previous large case-control study showed that genetic variation in the fibrinogen gamma gene (FGG) increased the risk for VT in adults. Here we investigated the association of haplotypes comprising the fibrinogen alpha (FGA) and gamma (FGG) genes, carriership of the Factor V Leiden mutation and risk for VT in a large family-based study sample for pediatric VT. Methods: We genotyped 188 pediatric VT families for seven single nucleotide polymorphisms (SNPs) (rs6050, rs2070016, rs2070014 and rs2070011, rs1049636, rs2066861, rs2066860) as well as the G1691A Factor V Leiden (FV) polymorphism. Association was assessed using the Transmission Disequilibrium Test (TDT) and corrected for multiple testing using permutation testing as implemented in HAPLOVIEW. Interaction between FV and FGA and FGG, respectively, was assessed by TDT in families stratified for presence or absence of the FV mutation in the affected child. Results: rs6050, rs2070016, rs2070014 and rs2070011 located in the FGA gene are in tight linkage disequilibrium (LD) and define 5 common haplotypes (HT) and are linked with the neighboring FGG gene (q= 0.91). rs1049636, rs2066861, rs2066860 located in FGG are in tight LD and define 4 HTs. HTs in both, FGA and FGG are significantly overtransmitted from parents to affected offspring (FGA: HT1 (AACT), HT frequency 0.32, T:U 62: 32, p=0.0025; FGG: HT2 (ATC), HT frequency 0.32, T:U 60:32, p=0.0035). When stratifying for FV status, it became apparent that the association between FGA and FGG and VT was more pronounced in FV-negative families (FGA, HT1, T:U 55:24, p=0.0006; FGG, HT2, T:U 55:24, p=0.0005), while absent in FV-positive families. Conclusion: Our results indicate that genetic variation in FGA and FGG are risk factors for VT in children, and further that an epistatic interaction between FGA/FGG and FV Leiden influences the risk of FGG and FGA on pediatric VT. Our study highlights the complex nature of VT and the necessity to evaluate gene-gene interactions in association studies of complex, polygenic diseases.


2021 ◽  
Author(s):  
Jyun-Hong Lin ◽  
Liang-Chi Chen ◽  
Shu-Qi Yu ◽  
Yao-Ting Huang

AbstractLong-read phasing has been used for reconstructing diploid genomes, improving variant calling, and resolving microbial strains in metagenomics. However, the phasing blocks of existing methods are broken by large Structural Variations (SVs), and the efficiency is unsatisfactory for population-scale phasing. This paper presents an ultra-fast algorithm, LongPhase, which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ∼10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin. In particular, LongPhase produces much larger phased blocks at almost chromosome level with only long reads (N50=26Mbp). We demonstrate that LongPhase combined with Nanopore is a cost-effective approach for providing chromosome-scale phasing without the need for additional trios, chromosome-conformation, and single-cell strand-seq data.


2019 ◽  
Author(s):  
Lu Zhang ◽  
Xin Zhou ◽  
Ziming Weng ◽  
Arend Sidow

AbstractBackgroundProducing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries.FindingsWe prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332X and 823X and assembly quality worsened if it increased to greater than 1,000X for a given C. Long DNA fragments could significantly extend phase blocks, but decreased contig contiguity. The optimal length-weighted fragment length (WμFL) was around 50 – 150kb. When broadly optimal parameters were used for library preparation and sequencing, ca. 80% of the genome was assembled in a diploid state.ConclusionThe Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (1) ◽  
pp. e1008748
Author(s):  
Benedict Wieters ◽  
Kim A. Steige ◽  
Fei He ◽  
Evan M. Koch ◽  
Sebastián E. Ramos-Onsins ◽  
...  

The rate at which plants grow is a major functional trait in plant ecology. However, little is known about its evolution in natural populations. Here, we investigate evolutionary and environmental factors shaping variation in the growth rate of Arabidopsis thaliana. We used plant diameter as a proxy to monitor plant growth over time in environments that mimicked latitudinal differences in the intensity of natural light radiation, across a set of 278 genotypes sampled within four broad regions, including an outgroup set of genotypes from China. A field experiment conducted under natural conditions confirmed the ecological relevance of the observed variation. All genotypes markedly expanded their rosette diameter when the light supply was decreased, demonstrating that environmental plasticity is a predominant source of variation to adapt plant size to prevailing light conditions. Yet, we detected significant levels of genetic variation both in growth rate and growth plasticity. Genome-wide association studies revealed that only 2 single nucleotide polymorphisms associate with genetic variation for growth above Bonferroni confidence levels. However, marginally associated variants were significantly enriched among genes with an annotated role in growth and stress reactions. Polygenic scores computed from marginally associated variants confirmed the polygenic basis of growth variation. For both light regimes, phenotypic divergence between the most distantly related population (China) and the various regions in Europe is smaller than the variation observed within Europe, indicating that the evolution of growth rate is likely to be constrained by stabilizing selection. We observed that Spanish genotypes, however, reach a significantly larger size than Northern European genotypes. Tests of adaptive divergence and analysis of the individual burden of deleterious mutations reveal that adaptive processes have played a more important role in shaping regional differences in rosette growth than maladaptive evolution.


GigaScience ◽  
2019 ◽  
Vol 8 (11) ◽  
Author(s):  
Lu Zhang ◽  
Xin Zhou ◽  
Ziming Weng ◽  
Arend Sidow

Abstract Background Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries. Results We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W${\mu _{FL}}$) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state. Conclusions The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.


Author(s):  
Benedict Wieters ◽  
Kim A. Steige ◽  
Fei He ◽  
Evan M. Koch ◽  
Sebastián E. Ramos-Onsins ◽  
...  

AbstractThe rate at which plants grow is a major functional trait in plant ecology. However, little is known about its evolution in natural populations. Here, we investigate evolutionary and environmental factors shaping variation in the growth rate of Arabidopsis thaliana. We used plant diameter as a proxy to monitor plant growth over time in environments that mimicked latitudinal differences in the intensity of natural light radiation, across a set of 278 genotypes sampled within four broad regions, including an outgroup set of genotypes from China. A field experiment conducted under natural conditions confirmed the ecological relevance of the observed variation. All genotypes markedly expanded their rosette diameter when the light supply was decreased, demonstrating that environmental plasticity is a predominant source of variation to adapt plant size to prevailing light conditions. Yet, we detected significant levels of genetic variation both in growth rate and growth plasticity. Genome-wide association studies revealed that only 2 single nucleotide polymorphisms associate with genetic variation for growth above Bonferroni confidence levels. However, marginally associated variants were significantly enriched among genes with an annotated role in growth and stress reactions. Polygenic scores computed from marginally associated variants confirmed the polygenic basis of growth variation. For both light regimes, phenotypic divergence between the most distantly related population (China) and the various regions in Europe is smaller than the variation observed within Europe, indicating that some level of stabilizing selection constrains the evolution of growth rate. We observed that Spanish genotypes, however, reach a significantly larger size than Northern European genotypes. Tests of adaptive divergence and analysis of the individual burden of deleterious mutations reveal that adaptive processes have played a more important role in shaping regional differences in rosette growth than maladaptive evolution.


2019 ◽  
Author(s):  
Corbin Quick ◽  
Pramod Anugu ◽  
Solomon Musani ◽  
Scott T. Weiss ◽  
Esteban G. Burchard ◽  
...  

ABSTRACTA key aim for current genome-wide association studies (GWAS) is to interrogate the full spectrum of genetic variation underlying human traits, including rare variants, across populations. Deep whole-genome sequencing is the gold standard to capture the full spectrum of genetic variation, but remains prohibitively expensive for large samples. Array genotyping interrogates a sparser set of variants, which can be used as a scaffold for genotype imputation to capture variation across a wider set of variants. However, imputation coverage and accuracy depend crucially on the reference panel size and genetic distance from the target population.Here, we consider a strategy in which a subset of study participants is sequenced and the rest array-genotyped and imputed using a reference panel that comprises the sequenced study participants and individuals from an external reference panel. We systematically assess how imputation quality and statistical power for association depend on the number of individuals sequenced and included in the reference panel for two admixed populations (African and Latino Americans) and two European population isolates (Sardinians and Finns). We develop a framework to identify powerful and cost-effective GWAS designs in these populations given current sequencing and array genotyping costs. For populations that are well-represented in current reference panels, we find that array genotyping alone is cost-effective and well-powered to detect both common- and rare-variant associations. For poorly represented populations, we find that sequencing a subset of study participants to improve imputation is often more cost-effective than array genotyping alone, and can substantially increase genomic coverage and power.


Sign in / Sign up

Export Citation Format

Share Document