scholarly journals Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Author(s):  
S. Rubinacci ◽  
D.M. Ribeiro ◽  
R. Hofmeister ◽  
O. Delaneau

AbstractLow-coverage whole genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined as current imputation methods are computationally expensive and unable to leverage large reference panels.Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. It achieves imputation of a full genome for less than $1, outperforming existing methods by orders of magnitude, with an increased accuracy of more than 20% at rare variants. We also show that 1x coverage enables effective association studies and is better suited than dense SNP arrays to access the impact of rare variations. Overall, this study demonstrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

2015 ◽  
Vol 112 (4) ◽  
pp. 1019-1024 ◽  
Author(s):  
Yi-Juan Hu ◽  
Yun Li ◽  
Paul L. Auer ◽  
Dan-Yu Lin

In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available.


Nature ◽  
2017 ◽  
Vol 550 (7675) ◽  
pp. 239-243 ◽  
Author(s):  
Xin Li ◽  
◽  
Yungil Kim ◽  
Emily K. Tsang ◽  
Joe R. Davis ◽  
...  

Abstract Rare genetic variants are abundant in humans and are expected to contribute to individual disease risk1,2,3,4. While genetic association studies have successfully identified common genetic variants associated with susceptibility, these studies are not practical for identifying rare variants1,5. Efforts to distinguish pathogenic variants from benign rare variants have leveraged the genetic code to identify deleterious protein-coding alleles1,6,7, but no analogous code exists for non-coding variants. Therefore, ascertaining which rare variants have phenotypic effects remains a major challenge. Rare non-coding variants have been associated with extreme gene expression in studies using single tissues8,9,10,11, but their effects across tissues are unknown. Here we identify gene expression outliers, or individuals showing extreme expression levels for a particular gene, across 44 human tissues by using combined analyses of whole genomes and multi-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) project v6p release12. We find that 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared to 8% of non-outliers. Additionally, we developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that incorporates expression data to predict a regulatory effect for rare variants with higher accuracy than models using genomic annotations alone. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues and provide an integrative method for interpretation of rare variants in individual genomes.


2021 ◽  
pp. 1-10
Author(s):  
Zoe Guan ◽  
Ronglai Shen ◽  
Colin B. Begg

<b><i>Background:</i></b> Many cancer types show considerable heritability, and extensive research has been done to identify germline susceptibility variants. Linkage studies have discovered many rare high-risk variants, and genome-wide association studies (GWAS) have discovered many common low-risk variants. However, it is believed that a considerable proportion of the heritability of cancer remains unexplained by known susceptibility variants. The “rare variant hypothesis” proposes that much of the missing heritability lies in rare variants that cannot reliably be detected by linkage analysis or GWAS. Until recently, high sequencing costs have precluded extensive surveys of rare variants, but technological advances have now made it possible to analyze rare variants on a much greater scale. <b><i>Objectives:</i></b> In this study, we investigated associations between rare variants and 14 cancer types. <b><i>Methods:</i></b> We ran association tests using whole-exome sequencing data from The Cancer Genome Atlas (TCGA) and validated the findings using data from the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG). <b><i>Results:</i></b> We identified four significant associations in TCGA, only one of which was replicated in PCAWG (BRCA1 and ovarian cancer). <b><i>Conclusions:</i></b> Our results provide little evidence in favor of the rare variant hypothesis. Much larger sample sizes may be needed to detect undiscovered rare cancer variants.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen ◽  
Kristian Hanghøj

Abstract Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.


2010 ◽  
Vol 118 (8) ◽  
pp. 487-506 ◽  
Author(s):  
Gavin R. Norton ◽  
Richard Brooksbank ◽  
Angela J. Woodiwiss

There is substantial evidence to suggest that BP (blood pressure) is an inherited trait. The introduction of gene technologies in the late 1980s generated a sharp phase of over-inflated prospects for polygenic traits such as hypertension. Not unexpectedly, the identification of the responsible loci in human populations has nevertheless proved to be a considerable challenge. Common variants of the RAS (renin–angiotensin system) genes, including of ACE (angiotensin-converting enzyme) and AGT (angiotensinogen) were some of the first shown to be associated with BP. Presently, ACE and AGT are the only gene variants with functional relevance, where linkage studies showing relationships with hypertension have been reproduced in some studies and where large population-based and prospective studies have demonstrated these genes to be predictors of hypertension or BP. Nevertheless, a lack of reproducibility in other linkage and association studies has generated scepticism that only a concerted effort to attempt to explain will rectify. Without these explanations, it is unlikely that this knowledge will translate into the clinical arena. In the present review, we show that many of the previous concerns in the field have been addressed, but we also argue that a considerable amount of careful thought is still required to achieve enlightenment with respect to the role of RAS genes in hypertension. We discuss whether the previously identified problems of poor study design have been completely addressed with regards to the impact of ACE and AGT genes on BP. In the context of RAS genes, we also question whether the significance of ‘incomplete penetrance’ through associated environmental, phenotypic or physiological effects has been duly accounted for; whether appropriate consideration has been given to epistatic interactions between genes; and whether future RAS gene studies should consider variation across the gene by evaluating ‘haplotypes’.


2016 ◽  
Author(s):  
Xin Li ◽  
Yungil Kim ◽  
Emily K. Tsang ◽  
Joe R. Davis ◽  
Farhan N. Damani ◽  
...  

AbstractRare genetic variants are abundant in humans yet their functional effects are often unknown and challenging to predict. The Genotype-Tissue Expression (GTEx) project provides a unique opportunity to identify the functional impact of rare variants through combined analyses of whole genomes and multi-tissue RNA-sequencing data. Here, we identify gene expression outliers, or individuals with extreme expression levels, across 44 human tissues, and characterize the contribution of rare variation to these large changes in expression. We find 58% of underexpression and 28% of overexpression outliers have underlying rare variants compared with 9% of non-outliers. Large expression effects are enriched for proximal loss-of-function, splicing, and structural variants, particularly variants near the TSS and at evolutionarily conserved sites. Known disease genes have expression outliers, underscoring that rare variants can contribute to genetic disease risk. To prioritize functional rare regulatory variants, we develop RIVER, a Bayesian approach that integrates RNA and whole genome sequencing data from the same individual. RIVER predicts functional variants significantly better than models using genomic annotations alone, and is an extensible tool for personal genome interpretation. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues with potential health consequences, and provide an integrative method for interpreting rare variants in individual genomes.


2018 ◽  
Author(s):  
Roger Ros-Freixedes ◽  
Battagin Mara ◽  
Martin Johnsson ◽  
Gregor Gorjanc ◽  
Alan J Mileham ◽  
...  

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1612 ◽  
Author(s):  
Zachery T. Lewis ◽  
Jasmine C.C. Davis ◽  
Jennifer T. Smilowitz ◽  
J. Bruce German ◽  
Carlito B. Lebrilla ◽  
...  

Infant fecal samples are commonly studied to investigate the impacts of breastfeeding on the development of the microbiota and subsequent health effects. Comparisons of infants living in different geographic regions and environmental contexts are needed to aid our understanding of evolutionarily-selected milk adaptations. However, the preservation of fecal samples from individuals in remote locales until they can be processed can be a challenge. Freeze-drying (lyophilization) offers a cost-effective way to preserve some biological samples for transport and analysis at a later date. Currently, it is unknown what, if any, biases are introduced into various analyses by the freeze-drying process. Here, we investigated how freeze-drying affected analysis of two relevant and intertwined aspects of infant fecal samples, marker gene amplicon sequencing of the bacterial community and the fecal oligosaccharide profile (undigested human milk oligosaccharides). No differences were discovered between the fecal oligosaccharide profiles of wet and freeze-dried samples. The marker gene sequencing data showed an increase in proportional representation ofBacteriodesand a decrease in detection of bifidobacteria and members of class Bacilli after freeze-drying. This sample treatment bias may possibly be related to the cell morphology of these different taxa (Gram status). However, these effects did not overwhelm the natural variation among individuals, as the community data still strongly grouped by subject and not by freeze-drying status. We also found that compensating for sample concentration during freeze-drying, while not necessary, was also not detrimental. Freeze-drying may therefore be an acceptable method of sample preservation and mass reduction for some studies of microbial ecology and milk glycan analysis.


2018 ◽  
Author(s):  
Torsten Günther ◽  
Carl Nettelblad

AbstractHigh quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map suc-cessfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele.In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudohaploid data, i.e. they randomly sample only one sequencing read per site.We show that reference bias is pervasive in published ancient DNA sequence data of pre-historic humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.


Sign in / Sign up

Export Citation Format

Share Document