scholarly journals Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractWe introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


2015 ◽  
Author(s):  
Hubert Pausch ◽  
Reiner Emmerling ◽  
Hermann Schwarzenbacher ◽  
Ruedi Fries

Background: The availability of whole-genome sequence data from key ancestors provides an exhaustive catalogue of polymorphic sites segregating within and across cattle breeds. Sequence variants from key ancestors can be imputed in animals that have been genotyped using medium- and high-density genotyping arrays. Association analysis with imputed sequences, particularly if applied to multiple traits simultaneously, is a very powerful approach to revealing candidate causal variants underlying complex phenotypes. Results: We used whole-genome sequence data from 157 key ancestors of the German Fleckvieh population to impute 20 561 798 sequence variants in 10 363 animals that had (partly imputed) array-derived genotypes at 634 109 SNP. The imputed sequence data were enriched for rare variants. Association studies with imputed sequence variants were performed using seven correlated udder conformation traits as response variables. The calculation of an approximate multi-trait test statistic enabled us to detect twelve major QTL (P<2.97 x 10-9) controlling different aspects of mammary gland morphology. Imputed sequence variants were the most significantly associated at eleven QTL, whereas the top association signal at a QTL on BTA14 resulted from an array-derived variant. Seven QTL were associated with multiple phenotypes. Most QTL were located in non-coding regions of the genome in close neighborhood, however, to plausible candidate genes for mammary gland morphology (SP5, GC, NPFFR2, CRIM1, RXFP2, TBX5, RBM19, ADAM12). Conclusions: Association analysis with imputed sequence variants allows QTL characterization at maximum resolution. Multi-trait approaches can reveal QTL that are not detected in single-trait association studies. Most QTL for udder conformation traits were located in non-coding elements of the genome suggesting regulatory mutations to be the major determinants of variation in mammary gland morphology in cattle.


2015 ◽  
Author(s):  
Shane McCarthy ◽  
Sayantan Das ◽  
Warren Kretzschmar ◽  
Olivier Delaneau ◽  
Andrew R. Wood ◽  
...  

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1%, a large increase in the number of SNPs tested in association studies and can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Alejandra Vergara-Lope ◽  
M. Reza Jabalameli ◽  
Clare Horscroft ◽  
Sarah Ennis ◽  
Andrew Collins ◽  
...  

Abstract Quantification of linkage disequilibrium (LD) patterns in the human genome is essential for genome-wide association studies, selection signature mapping and studies of recombination. Whole genome sequence (WGS) data provides optimal source data for this quantification as it is free from biases introduced by the design of array genotyping platforms. The Malécot-Morton model of LD allows the creation of a cumulative map for each choromosome, analogous to an LD form of a linkage map. Here we report LD maps generated from WGS data for a large population of European ancestry, as well as populations of Baganda, Ethiopian and Zulu ancestry. We achieve high average genetic marker densities of 2.3–4.6/kb. These maps show good agreement with prior, low resolution maps and are consistent between populations. Files are provided in BED format to allow researchers to readily utilise this resource.


2021 ◽  
Vol 53 (1) ◽  
Author(s):  
Sunduimijid Bolormaa ◽  
Andrew A. Swan ◽  
Paul Stothard ◽  
Majid Khansefid ◽  
Nasir Moghaddar ◽  
...  

Abstract Background Imputation to whole-genome sequence is now possible in large sheep populations. It is therefore of interest to use this data in genome-wide association studies (GWAS) to investigate putative causal variants and genes that underpin economically important traits. Merino wool is globally sought after for luxury fabrics, but some key wool quality attributes are unfavourably correlated with the characteristic skin wrinkle of Merinos. In turn, skin wrinkle is strongly linked to susceptibility to “fly strike” (Cutaneous myiasis), which is a major welfare issue. Here, we use whole-genome sequence data in a multi-trait GWAS to identify pleiotropic putative causal variants and genes associated with changes in key wool traits and skin wrinkle. Results A stepwise conditional multi-trait GWAS (CM-GWAS) identified putative causal variants and related genes from 178 independent quantitative trait loci (QTL) of 16 wool and skin wrinkle traits, measured on up to 7218 Merino sheep with 31 million imputed whole-genome sequence (WGS) genotypes. Novel candidate gene findings included the MAT1A gene that encodes an enzyme involved in the sulphur metabolism pathway critical to production of wool proteins, and the ESRP1 gene. We also discovered a significant wrinkle variant upstream of the HAS2 gene, which in dogs is associated with the exaggerated skin folds in the Shar-Pei breed. Conclusions The wool and skin wrinkle traits studied here appear to be highly polygenic with many putative candidate variants showing considerable pleiotropy. Our CM-GWAS identified many highly plausible candidate genes for wool traits as well as breech wrinkle and breech area wool cover.


2021 ◽  
Vol 12 ◽  
Author(s):  
Hao Cheng ◽  
Keyu Xu ◽  
Jinghui Li ◽  
Kuruvilla Joseph Abraham

Low-cost genome-wide single-nucleotide polymorphisms (SNPs) are routinely used in animal breeding programs. Compared to SNP arrays, the use of whole-genome sequence data generated by the next-generation sequencing technologies (NGS) has great potential in livestock populations. However, sequencing a large number of animals to exploit the full potential of whole-genome sequence data is not feasible. Thus, novel strategies are required for the allocation of sequencing resources in genotyped livestock populations such that the entire population can be imputed, maximizing the efficiency of whole genome sequencing budgets. We present two applications of linear programming for the efficient allocation of sequencing resources. The first application is to identify the minimum number of animals for sequencing subject to the criterion that each haplotype in the population is contained in at least one of the animals selected for sequencing. The second application is the selection of animals whose haplotypes include the largest possible proportion of common haplotypes present in the population, assuming a limited sequencing budget. Both applications are available in an open source program LPChoose. In both applications, LPChoose has similar or better performance than some other methods suggesting that linear programming methods offer great potential for the efficient allocation of sequencing resources. The utility of these methods can be increased through the development of improved heuristics.


2015 ◽  
Vol 6 (1) ◽  
Author(s):  
Peter N. Taylor ◽  
◽  
Eleonora Porcu ◽  
Shelby Chew ◽  
Purdey J. Campbell ◽  
...  

Abstract Normal thyroid function is essential for health, but its genetic architecture remains poorly understood. Here, for the heritable thyroid traits thyrotropin (TSH) and free thyroxine (FT4), we analyse whole-genome sequence data from the UK10K project (N=2,287). Using additional whole-genome sequence and deeply imputed data sets, we report meta-analysis results for common variants (MAF≥1%) associated with TSH and FT4 (N=16,335). For TSH, we identify a novel variant in SYN2 (MAF=23.5%, P=6.15 × 10−9) and a new independent variant in PDE8B (MAF=10.4%, P=5.94 × 10−14). For FT4, we report a low-frequency variant near B4GALT6/SLC25A52 (MAF=3.2%, P=1.27 × 10−9) tagging a rare TTR variant (MAF=0.4%, P=2.14 × 10−11). All common variants explain ≥20% of the variance in TSH and FT4. Analysis of rare variants (MAF<1%) using sequence kernel association testing reveals a novel association with FT4 in NRG1. Our results demonstrate that increased coverage in whole-genome sequence association studies identifies novel variants associated with thyroid function.


Foods ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 1794
Author(s):  
Elizabeth Sage Hunter ◽  
Robert Literman ◽  
Sara M. Handy

The botanical genus Digitalis is equal parts colorful, toxic, and medicinal, and its bioactive compounds have a long history of therapeutic use. However, with an extremely narrow therapeutic range, even trace amounts of Digitalis can cause adverse effects. Using chemical methods, the United States Food and Drug Administration traced a 1997 case of Digitalis toxicity to a shipment of Plantago (a common ingredient in dietary supplements marketed to improve digestion) contaminated with Digitalis lanata. With increased accessibility to next generation sequencing technology, here we ask whether this case could have been cracked rapidly using shallow genome sequencing strategies (e.g., genome skims). Using a modified implementation of the Site Identification from Short Read Sequences (SISRS) bioinformatics pipeline with whole-genome sequence data, we generated over 2 M genus-level single nucleotide polymorphisms in addition to species-informative single nucleotide polymorphisms. We simulated dietary supplement contamination by spiking low quantities (0–10%) of Digitalis whole-genome sequence data into a background of commonly used ingredients in products marketed for “digestive cleansing” and reliably detected Digitalis at the genus level while also discriminating between Digitalis species. This work serves as a roadmap for the development of novel DNA-based assays to quickly and reliably detect the presence of toxic species such as Digitalis in food products or dietary supplements using genomic methods and highlights the power of harnessing the entire genome to identify botanical species.


2020 ◽  
Author(s):  
Hao Cheng ◽  
Keyu Xu ◽  
Kuruvilla Joseph Abraham

AbstractBackgroundLow-cost genome-wide single-nucleotide polymorphisms (SNPs) are routinely used in animal breeding programs. Compared to SNP arrays, the use of whole-genome sequence data generated by the next-generation sequencing technologies (NGS) has great potential in livestock populations. However, a large number of animals are required to be sequenced to exploit the full potential of whole-genome sequence data. Thus, novel strategies are desired to allocate sequencing resources in genotyped livestock populations such that the entire population can be sequenced or imputed efficiently.MethodsWe present two applications of linear programming models called LPChoose for sequencing resources allocation. The first application is to identify the minimum number of animals for sequencing while meeting the criteria that each haplotype in the population is contained in at least one of the animals selected for sequencing. The second is to sequence a fixed number of animals whose haplotypes include as large a proportion as possible of the haplotypes present in the population given a limited sequencing budget. In both cases, we assume that all animals have been haplotyped. We present results from approximation algorithms, and motivate the use of approximations through the correspondence of the problems we address with problems in computer science for which there are no known efficient algorithms.ResultsIn both applications LPChoose performed consistently better than some existing methods making similar assumptions.


Sign in / Sign up

Export Citation Format

Share Document