Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes

Abstract Although ancient DNA data have become increasingly more important in studies about past populations, it is often not feasible or practical to obtain high coverage genomes from poorly preserved samples. While methods of accurate genotype imputation from > 1 × coverage data have recently become a routine, a large proportion of ancient samples remain unusable for downstream analyses due to their low coverage. Here, we evaluate a two-step pipeline for the imputation of common variants in ancient genomes at 0.05–1 × coverage. We use the genotype likelihood input mode in Beagle and filter for confident genotypes as the input to impute missing genotypes. This procedure, when tested on ancient genomes, outperforms a single-step imputation from genotype likelihoods, suggesting that current genotype callers do not fully account for errors in ancient sequences and additional quality controls can be beneficial. We compared the effect of various genotype likelihood calling methods, post-calling, pre-imputation and post-imputation filters, different reference panels, as well as different imputation tools. In a Neolithic Hungarian genome, we obtain ~ 90% imputation accuracy for heterozygous common variants at coverage 0.05 × and > 97% accuracy at coverage 0.5 ×. We show that imputation can mitigate, though not eliminate reference bias in ultra-low coverage ancient genomes.

Download Full-text

An empirical evaluation of genotype imputation of ancient DNA

10.1101/2021.12.22.473913 ◽

2021 ◽

Author(s):

Kristiina Ausmees ◽

Federico Sanchez-Quinto ◽

Mattias Jakobsson ◽

Carl Nettelblad

Keyword(s):

Ancient Dna ◽

Empirical Evaluation ◽

Genotype Imputation ◽

Systematic Evaluation ◽

High Coverage ◽

Depth Analysis ◽

Missing Genotypes ◽

And Performance ◽

Downstream Analysis ◽

Reference Bias

With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle 4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference and study sample size. Making use of five ancient samples with high-coverage data available, we evaluated imputed data with respect to accuracy, reference bias and genetic affinities as captured by PCA. We obtained genotype concordance levels of over 99% for data with 1x coverage, and similar levels of accuracy and reference bias at levels as low as 0.75x. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1x. We also show that a large and varied phased reference set as well as the inclusion of low- to moderate-coverage ancient samples can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for post-processing and validation prior to downstream analysis.

Download Full-text

EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

10.1101/2022.01.11.475810 ◽

2022 ◽

Author(s):

Lars Wienbrandt ◽

David Ellinghaus

Keyword(s):

Memory Management ◽

Imputation Accuracy ◽

Simulated Data ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Common Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Genome Wide ◽

Reference Genomes

Background: Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Methods: We developed EagleImp, a software with algorithmic and technical improvements and new features for accurate and accelerated phasing and imputation in a single tool. Results: We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with more than 1 million reference genomes. EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical GWAS studies, EagleImp provides same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. It has many new features, including automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files, and various user-configurable algorithm and output options. Conclusions: Due to the technical optimisations, EagleImp can perform fast and accurate reference-based phasing and imputation for future very large reference panels with more than 1 million genomes. EagleImp is freely available for download from https://github.com/ikmb/eagleimp.

Download Full-text

Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores

Genome Medicine ◽

10.1186/s13073-019-0682-2 ◽

2019 ◽

Vol 11 (1) ◽

Cited By ~ 7

Author(s):

Julian R. Homburger ◽

Cynthia L. Neben ◽

Gilad Mishne ◽

Alicia Y. Zhou ◽

Sekar Kathiresan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Imputation Accuracy ◽

European Ancestry ◽

Whole Genome ◽

Common Variants ◽

Genotyping Array ◽

Genome Wide ◽

Polygenic Scores ◽

Low Coverage

Abstract Background Inherited susceptibility to common, complex diseases may be caused by rare, pathogenic variants (“monogenic”) or by the cumulative effect of numerous common variants (“polygenic”). Comprehensive genome interpretation should enable assessment for both monogenic and polygenic components of inherited risk. The traditional approach requires two distinct genetic testing technologies—high coverage sequencing of known genes to detect monogenic variants and a genome-wide genotyping array followed by imputation to calculate genome-wide polygenic scores (GPSs). We assessed the feasibility and accuracy of using low coverage whole genome sequencing (lcWGS) as an alternative to genotyping arrays to calculate GPSs. Methods First, we performed downsampling and imputation of WGS data from ten individuals to assess concordance with known genotypes. Second, we assessed the correlation between GPSs for 3 common diseases—coronary artery disease (CAD), breast cancer (BC), and atrial fibrillation (AF)—calculated using lcWGS and genotyping array in 184 samples. Third, we assessed concordance of lcWGS-based genotype calls and GPS calculation in 120 individuals with known genotypes, selected to reflect diverse ancestral backgrounds. Fourth, we assessed the relationship between GPSs calculated using lcWGS and disease phenotypes in a cohort of 11,502 individuals of European ancestry. Results We found imputation accuracy r2 values of greater than 0.90 for all ten samples—including those of African and Ashkenazi Jewish ancestry—with lcWGS data at 0.5×. GPSs calculated using lcWGS and genotyping array followed by imputation in 184 individuals were highly correlated for each of the 3 common diseases (r2 = 0.93–0.97) with similar score distributions. Using lcWGS data from 120 individuals of diverse ancestral backgrounds, we found similar results with respect to imputation accuracy and GPS correlations. Finally, we calculated GPSs for CAD, BC, and AF using lcWGS in 11,502 individuals of European ancestry, confirming odds ratios per standard deviation increment ranging 1.28 to 1.59, consistent with previous studies. Conclusions lcWGS is an alternative technology to genotyping arrays for common genetic variant assessment and GPS calculation. lcWGS provides comparable imputation accuracy while also overcoming the ascertainment bias inherent to variant selection in genotyping array design.

Download Full-text

Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

10.21203/rs.3.rs-607740/v1 ◽

2021 ◽

Author(s):

Changheng Zhao ◽

Jun Teng ◽

Xinhao Zhang ◽

Dan Wang ◽

Xinyi Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genomic Selection ◽

Genome Sequencing ◽

Sequence Data ◽

Low Cost ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Low Coverage

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.

Download Full-text

snpAD: An ancient DNA genotype caller

10.1101/288258 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kay Prüfer

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Expectation Maximization ◽

Expectation Maximization Algorithm ◽

Supplementary Information ◽

High Coverage ◽

Genotype Calling ◽

Genotype Frequencies ◽

Base Modifications ◽

Low Coverage

AbstractMotivationThe study of ancient genomes can elucidate the evolutionary past. However, analyses are complicated by base-modifications in ancient DNA molecules that result in errors in DNA sequences. These errors are particularly common near the ends of sequences and pose a challenge for genotype calling.ResultsI describe an expectation-maximization algorithm that estimates genotype frequencies and errors along sequences to allow for accurate genotype calling from ancient sequences. The implementation of this method, called snpAD, performs well on high-coverage ancient data, as shown by simulations and by subsampling the data of a high-coverage Neandertal genome. Although estimates for low-coverage genomes are less accurate, I am able to derive approximate estimates of heterozygosity from several low-coverage Neandertals. These estimates show that low heterozygosity, compared to modern humans, was common among Neandertals.AvailabilityThe C++ code of snpAD is freely available at http://bioinf.eva.mpg.de/snpAD/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A Fast Data-Driven Method for Genotype Imputation, Phasing, and Local Ancestry Inference: MendelImpute.jl

10.1101/2020.10.24.353755 ◽

2020 ◽

Author(s):

Benjamin B. Chu ◽

Eric M. Sobel ◽

Rory Wasiolek ◽

Janet S. Sinsheimer ◽

Hua Zhou ◽

...

Keyword(s):

Markov Models ◽

Hidden Markov ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Data Transport ◽

Model Calculations ◽

Local Ancestry ◽

Order Of Magnitude ◽

Computationally Intensive ◽

Missing Genotypes

1AbstractCurrent methods for genotype imputation and phasing exploit the sheer volume of data in haplotype reference panels and rely on hidden Markov models. Existing programs all have essentially the same imputation accuracy, are computationally intensive, and generally require pre-phasing the typed markers. We propose a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for hidden Markov model calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage, and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs. Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing.

Download Full-text

Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.704118 ◽

2022 ◽

Vol 12 ◽

Author(s):

Tianyu Deng ◽

Pengfei Zhang ◽

Dorian Garrick ◽

Huijiang Gao ◽

Lixian Wang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Demographic History ◽

Imputation Accuracy ◽

Reference Population ◽

Genotype Imputation ◽

Whole Genome ◽

Snp Chip ◽

A Genome ◽

Low Coverage

Genotype imputation is the term used to describe the process of inferring unobserved genotypes in a sample of individuals. It is a key step prior to a genome-wide association study (GWAS) or genomic prediction. The imputation accuracy will directly influence the results from subsequent analyses. In this simulation-based study, we investigate the accuracy of genotype imputation in relation to some factors characterizing SNP chip or low-coverage whole-genome sequencing (LCWGS) data. The factors included the imputation reference population size, the proportion of target markers /SNP density, the genetic relationship (distance) between the target population and the reference population, and the imputation method. Simulations of genotypes were based on coalescence theory accounting for the demographic history of pigs. A population of simulated founders diverged to produce four separate but related populations of descendants. The genomic data of 20,000 individuals were simulated for a 10-Mb chromosome fragment. Our results showed that the proportion of target markers or SNP density was the most critical factor affecting imputation accuracy under all imputation situations. Compared with Minimac4, Beagle5.1 reproduced higher-accuracy imputed data in most cases, more notably when imputing from the LCWGS data. Compared with SNP chip data, LCWGS provided more accurate genotype imputation. Our findings provided a relatively comprehensive insight into the accuracy of genotype imputation in a realistic population of domestic animals.

Download Full-text

Family-specific genotype arrays increase the accuracy of pedigree based imputation at very low marker densities

10.1101/502989 ◽

2018 ◽

Author(s):

Andrew Whalen ◽

John M Hickey ◽

Gregor Gorjanc

Keyword(s):

Imputation Accuracy ◽

Population Based ◽

High Density ◽

Genotype Imputation ◽

Low Density ◽

Marker Selection ◽

Selection Strategies ◽

Statistical Regularities ◽

Missing Genotypes ◽

Density Marker

In this paper we evaluate the performance of using a family-specific low-density genotype arrays to increase the accuracy of pedigree based imputation. Genotype imputation is a widely used tool that decreases the costs of genotyping a population by genotyping the majority of individuals using a low-density array and using statistical regularities between the low-density and high-density individuals to fill in the missing genotypes. Previous work on population based imputation has found that it is possible to increase the accuracy of imputation by maximizing the number of informative markers on an array. In the context of pedigree based imputation, where the informativeness of a marker depends only on the genotypes of an individual's parents, it may be beneficial to select the markers on each low-density array on a family-by-family basis. In this paper we examined four family-specific low-density marker selection strategies, and evaluated their performance in the context of a real pig breeding dataset. We found that family-specific or sire-specific arrays could increase imputation accuracy by 0.11 at 1 marker per chromosome, by 0.027 at 25 markers per chromosome and by 0.007 at 100 markers per chromosome. These results suggest that there may be a room to use family-specific genotyping for very-low-density arrays particularly if a given sire or sire-dam pairing have a large number of offspring.

Download Full-text

A method for allocating low-coverage sequencing resources by targeting haplotypes rather than individuals

10.1101/188896 ◽

2017 ◽

Cited By ~ 1

Author(s):

Roger Ros-Freixedes ◽

Serap Gonen ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

Imputation Accuracy ◽

Heuristic Method ◽

Score Function ◽

Population Based ◽

Sequence Information ◽

Target Coverage ◽

High Coverage ◽

Fixed Amount ◽

Maximum Score ◽

Low Coverage

AbstractBackgroundThis paper describes a heuristic method for allocating low-coverage sequencing resources by targeting haplotypes rather than individuals. Low-coverage sequencing assembles high-coverage sequence information for every individual by accumulating data from the genome segments that they share with many other individuals into consensus haplotypes. Deriving the consensus haplotypes accurately is critical for achieving a high phasing and imputation accuracy. In order to enable accurate phasing and imputation of sequence information for the whole population we allocate the available sequencing resources among individuals with existing phased genomic data by targeting the sequencing coverage of their haplotypes.ResultsOur method, called AlphaSeqOpt, prioritizes haplotypes using a score function that is based on the frequency of the haplotypes in the sequencing set relative to the target coverage. AlphaSeqOpt has two steps: (1) selection of an initial set of individuals by iteratively choosing the individuals that have the maximum score conditional to the current set, and (2) refinement of the set through several rounds of exchanges of individuals. AlphaSeqOpt is very effective for distributing a fixed amount of sequencing resources evenly across haplotypes, which results in a reduction of the proportion of haplotypes that are sequenced below the target coverage. AlphaSeqOpt can provide a greater proportion of haplotypes sequenced at the target coverage by sequencing less individuals, as compared with other methods that use a score function based on the haplotypes population frequency. A refinement of the initially selected set can provide a larger more diverse set with more unique individuals, which is beneficial in the context of low-coverage sequencing. We extend the method with an approach to filter rare haplotypes based on their flanking haplotypes, so that only those that are likely to derive from a recombination event are targeted.ConclusionsWe present a method for allocating sequencing resources so that a greater proportion of haplotypes are sequenced at a coverage that is sufficiently high for population-based imputation with low-coverage sequencing. The haplotype score function, the refinement step, and the new approach of filtering rare haplotypes make AlphaSeqOpt more effective for that purpose than methods reported previously for reducing sequencing redundancy.

Download Full-text

Genotype calling and haplotype inference from low coverage sequence data in heterozygous plant genome using HetMap

10.21203/rs.3.rs-1220819/v1 ◽

2022 ◽

Author(s):

Hao Gong ◽

Bin Han

Keyword(s):

Wild Rice ◽

Hybrid Rice ◽

Sequence Data ◽

Genotype Imputation ◽

Plant Genome ◽

High Coverage ◽

Software Packages ◽

Heterozygous Plant ◽

Low Coverage ◽

Genotype Inference

Abstract Many software packages and pipelines had been developed to handle the sequence data of the model species. However, Genotyping from complex heterozygous plant genome needs further improvement on the previous methods. Here we present a new pipeline available at https://github.com/Ncgrhg/HetMapv1) for variant calling and missing genotype imputation from low coverage sequence data for heterozygous plant genomes. To check the performance of the HetMap on the real sequence data, HetMap was applied to both the F1 hybrid rice population which consists of 1495 samples and wild rice population with 446 samples. Four high coverage sequence hybrid rice accessions and two high coverage sequence wild rice accessions, which were also included in low coverage sequence data, are used to validate the genotype inference accuracy. The validation results showed that HetMap archived significant improvement in heterozygous genotype inference accuracy (13.65% for hybrid rice, 26.05% for wild rice) and total accuracy compared with other similar software packages. The application of the new genotype with the genome wide association study also showed improvement of association power in two wild rice phenotypes. It could archive high genotype inference accuracy with low sequence coverage with a small population size with both the natural population and constructed recombination population. HetMap provided a powerful tool for the heterozygous plant genome sequence data analysis, which may help the discover of new phenotype regions for the plant species with complex heterozygous genome.

Download Full-text