Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data

Genotype imputation is the term used to describe the process of inferring unobserved genotypes in a sample of individuals. It is a key step prior to a genome-wide association study (GWAS) or genomic prediction. The imputation accuracy will directly influence the results from subsequent analyses. In this simulation-based study, we investigate the accuracy of genotype imputation in relation to some factors characterizing SNP chip or low-coverage whole-genome sequencing (LCWGS) data. The factors included the imputation reference population size, the proportion of target markers /SNP density, the genetic relationship (distance) between the target population and the reference population, and the imputation method. Simulations of genotypes were based on coalescence theory accounting for the demographic history of pigs. A population of simulated founders diverged to produce four separate but related populations of descendants. The genomic data of 20,000 individuals were simulated for a 10-Mb chromosome fragment. Our results showed that the proportion of target markers or SNP density was the most critical factor affecting imputation accuracy under all imputation situations. Compared with Minimac4, Beagle5.1 reproduced higher-accuracy imputed data in most cases, more notably when imputing from the LCWGS data. Compared with SNP chip data, LCWGS provided more accurate genotype imputation. Our findings provided a relatively comprehensive insight into the accuracy of genotype imputation in a realistic population of domestic animals.

Download Full-text

Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

10.21203/rs.3.rs-607740/v1 ◽

2021 ◽

Author(s):

Changheng Zhao ◽

Jun Teng ◽

Xinhao Zhang ◽

Dan Wang ◽

Xinyi Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genomic Selection ◽

Genome Sequencing ◽

Sequence Data ◽

Low Cost ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Low Coverage

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.

Download Full-text

Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores

Genome Medicine ◽

10.1186/s13073-019-0682-2 ◽

2019 ◽

Vol 11 (1) ◽

Cited By ~ 7

Author(s):

Julian R. Homburger ◽

Cynthia L. Neben ◽

Gilad Mishne ◽

Alicia Y. Zhou ◽

Sekar Kathiresan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Imputation Accuracy ◽

European Ancestry ◽

Whole Genome ◽

Common Variants ◽

Genotyping Array ◽

Genome Wide ◽

Polygenic Scores ◽

Low Coverage

Abstract Background Inherited susceptibility to common, complex diseases may be caused by rare, pathogenic variants (“monogenic”) or by the cumulative effect of numerous common variants (“polygenic”). Comprehensive genome interpretation should enable assessment for both monogenic and polygenic components of inherited risk. The traditional approach requires two distinct genetic testing technologies—high coverage sequencing of known genes to detect monogenic variants and a genome-wide genotyping array followed by imputation to calculate genome-wide polygenic scores (GPSs). We assessed the feasibility and accuracy of using low coverage whole genome sequencing (lcWGS) as an alternative to genotyping arrays to calculate GPSs. Methods First, we performed downsampling and imputation of WGS data from ten individuals to assess concordance with known genotypes. Second, we assessed the correlation between GPSs for 3 common diseases—coronary artery disease (CAD), breast cancer (BC), and atrial fibrillation (AF)—calculated using lcWGS and genotyping array in 184 samples. Third, we assessed concordance of lcWGS-based genotype calls and GPS calculation in 120 individuals with known genotypes, selected to reflect diverse ancestral backgrounds. Fourth, we assessed the relationship between GPSs calculated using lcWGS and disease phenotypes in a cohort of 11,502 individuals of European ancestry. Results We found imputation accuracy r2 values of greater than 0.90 for all ten samples—including those of African and Ashkenazi Jewish ancestry—with lcWGS data at 0.5×. GPSs calculated using lcWGS and genotyping array followed by imputation in 184 individuals were highly correlated for each of the 3 common diseases (r2 = 0.93–0.97) with similar score distributions. Using lcWGS data from 120 individuals of diverse ancestral backgrounds, we found similar results with respect to imputation accuracy and GPS correlations. Finally, we calculated GPSs for CAD, BC, and AF using lcWGS in 11,502 individuals of European ancestry, confirming odds ratios per standard deviation increment ranging 1.28 to 1.59, consistent with previous studies. Conclusions lcWGS is an alternative technology to genotyping arrays for common genetic variant assessment and GPS calculation. lcWGS provides comparable imputation accuracy while also overcoming the ascertainment bias inherent to variant selection in genotyping array design.

Download Full-text

Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost efficient approach

10.1101/421768 ◽

2018 ◽

Author(s):

Yanjun Zan ◽

Thibaut Payen ◽

Mette Lillie ◽

Christa F. Honaker ◽

Paul B. Siegel ◽

...

Keyword(s):

High Resolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Genotype Imputation ◽

Whole Genome ◽

Efficient Manner ◽

Founder Line ◽

Cost Efficient ◽

Low Coverage

ABSTRACTBackgroundExperimental intercrosses between outbred founder populations are powerful resources for mapping loci contributing to complex traits (Quantitative Trait Loci or QTL). Here, we present an approach and accompanying software for high-resolution genotype imputation in such populations using whole-genome high coverage sequence data on founder individuals (∼30×) and low coverage sequence data on intercross individuals (∼0.4×). The method is illustrated in a large F2 pedigree between lines of chickens that have been divergently selected for 40 generations for the same trait (body weight at 8 weeks of age).ResultsDescribed is how hundreds of individuals were whole-genome sequenced in a cost- and time-efficient manner using a Tn5-based library preparation protocol optimized for this application. In total, 7.6M markers segregated in this pedigree and 10.0 to 13.7% were informative for imputing the founder line genotypes within the F0-F2 families. The genotypes imputed from low coverage sequence data were consistent with the founder line genotypes estimated using SNP and microsatellite markers both at individual imputed sites (92%) and across the genome of individual chickens (93%). The resolution of the recombination breakpoints was high with 50% being resolved within <10kb.ConclusionsA method for genotype imputation from low-coverage whole-genome sequencing in outbred intercrosses is described and evaluated. By applying it to an outbred chicken F2 cross it is illustrated that it provides high quality, high-resolution genotypes in a time and cost efficient manner.

Download Full-text

Whole Genome Sequencing Reveals Multiple Linked Genetic Variants on Canine Chromosome 12 Associated with Risk for Symmetrical Lupoid Onychodystrophy (SLO) in the Bearded Collie

Genes ◽

10.3390/genes12081265 ◽

2021 ◽

Vol 12 (8) ◽

pp. 1265

Author(s):

Liza C. Gershony ◽

Janelle M. Belanger ◽

Marjo K. Hytönen ◽

Hannes Lohi ◽

Anita M. Oberbauer

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Class Ii ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Strong Linkage Disequilibrium ◽

Whole Genome ◽

Potential Candidate ◽

A Genome ◽

Canine Chromosome

In dogs, symmetrical lupoid onychodystrophy (SLO) results in nail loss and an abnormal regrowth of the claws. In Bearded Collies, an autoimmune nature has been suggested because certain dog leukocyte antigen (DLA) class II haplotypes are associated with the condition. A genome-wide association study of the Bearded Collie revealed two regions of association that conferred risk for disease: one on canine chromosome (CFA) 12 that encompasses the DLA genes, and one on CFA17. Case-control association was employed on whole genome sequencing data to uncover putative causative variants in SLO within the CFA12 and CFA17 associated regions. Genotype imputation was then employed to refine variants of interest. Although no SLO-associated protein-coding variants were identified on CFA17, multiple variants, many with predicted damaging effects, were identified within potential candidate genes on CFA12. Furthermore, many potentially damaging alleles were fully correlated with the presence of DLA class II risk haplotypes for SLO, suggesting that the variants may reflect DLA class II haplotype association with disease or vice versa. Strong linkage disequilibrium in the region precluded the ability to isolate and assess the individual or combined effect of variants on disease development. Nonetheless, all were predictive of risk for SLO and, with judicious assessment, their application in selective breeding may prove useful to reduce the incidence of SLO in the breed.

Download Full-text

Low coverage whole genome sequencing enables accurate assessment of common variants and calculation of genome-wide polygenic scores

10.1101/716977 ◽

2019 ◽

Author(s):

Julian R. Homburger ◽

Cynthia L. Neben ◽

Gilad Mishne ◽

Alicia Y. Zhou ◽

Sekar Kathiresan ◽

...

Keyword(s):

Genetic Testing ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Imputation Accuracy ◽

Whole Genome ◽

Genotyping Array ◽

Pathogenic Variants ◽

Genome Wide ◽

Polygenic Scores ◽

Low Coverage

ABSTRACTBackgroundThe inherited susceptibility of common, complex diseases may be caused by rare, ‘monogenic’ pathogenic variants or by the cumulative effect of numerous common, ‘polygenic’ variants. As such, comprehensive genome interpretation could involve two distinct genetic testing technologies -- high coverage next generation sequencing for known genes to detect pathogenic variants and a genome-wide genotyping array followed by imputation to calculate genome-wide polygenic scores (GPSs). Here we assessed the feasibility and accuracy of using low coverage whole genome sequencing (lcWGS) as an alternative to genotyping arrays to calculate GPSs.MethodsFirst, we performed downsampling and imputation of WGS data from ten individuals to assess concordance with known genotypes. Second, we assessed the correlation between GPSs for three common diseases -- coronary artery disease (CAD), breast cancer (BC), and atrial fibrillation (AF) -- calculated using lcWGS and genotyping array in 184 samples. Third, we assessed concordance of lcWGS-based genotype calls and GPS calculation in 120 individuals with known genotypes, selected to reflect diverse ancestral backgrounds. Fourth, we assessed the relationship between GPSs calculated using lcWGS and disease phenotypes in 11,502 European individuals seeking genetic testing.ResultsWe found imputation accuracy r2 values of greater than 0.90 for all ten samples -- including those of African and Ashkenazi Jewish ancestry -- with lcWGS data at 0.5X. GPSs calculated using both lcWGS and genotyping array followed by imputation in 184 individuals were highly correlated for each of the three common diseases (r2 = 0.93 - 0.97) with similar score distributions. Using lcWGS data from 120 individuals of diverse ancestral backgrounds, including South Asian, East Asian, and Hispanic individuals, we found similar results with respect to imputation accuracy and GPS correlations. Finally, we calculated GPSs for CAD, BC, and AF using lcWGS in 11,502 European individuals, confirming odds ratios per standard deviation increment in GPSs ranging 1.28 to 1.59, consistent with previous studies.ConclusionsHere we show that lcWGS is an alternative approach to genotyping arrays for common genetic variant assessment and GPS calculation. lcWGS provides comparable imputation accuracy while also overcoming the ascertainment bias inherent to variant selection in genotyping array design.

Download Full-text

Genome-wide association analyses of multiple traits in Duroc pigs using low-coverage whole-genome sequencing strategy

10.1101/754671 ◽

2019 ◽

Author(s):

Ruifei Yang ◽

Xiaoli Guo ◽

Di Zhu ◽

Cheng Bian ◽

Yiqiang Zhao ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Association Analysis ◽

Genome Sequencing ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association ◽

Whole Genome ◽

Genome Wide Association Analysis ◽

Genome Wide ◽

Low Coverage

AbstractHigh-density markers discovered in large size samples are essential for mapping complex traits at the gene-level resolution for agricultural livestock and crops. However, the unavailability of large reference panels and array designs for a target population of agricultural species limits the improvement of array-based genotype imputation. Recent studies showed very low coverage sequencing (LCS) of a large number of individuals is a cost-effective approach to discover variations in much greater detail in association studies. Here, we performed cohort-wide whole-genome sequencing at an average depth of 0.73× and identified more than 11.3 M SNPs. We also evaluated the data set and performed genome-wide association analysis (GWAS) in 2885 Duroc boars. We compared two different pipelines and selected a proper method (BaseVar/STITCH) for LCS analyses and determined that sequencing of 1000 individuals with 0.2× depth is enough for identifying SNPs with high accuracy in this population. Of the seven association signals derived from the genome-wide association analysis of the LCS variants, which were associated with four economic traits, we found two QTLs with narrow intervals were possibly responsible for the teat number and back fat thickness traits and identified 7 missense variants in a single sequencing step. This strategy (BaseVar/STITCH) is generally applicable to any populations and any species which have no suitable reference panels. These findings show that the LCS strategy is a proper approach for the construction of new genetic resources to facilitate genome-wide association studies, fine mapping of QTLs, and genomic selection, and implicate that it can be widely used for agricultural animal breeding in the future.

Download Full-text

353 ASAS-EAAP Talk: Low-coverage whole-genome sequencing in local livestock breeds

Journal of Animal Science ◽

10.1093/jas/skaa278.149 ◽

2020 ◽

Vol 98 (Supplement_4) ◽

pp. 81-82

Author(s):

Joaquim Casellas ◽

Melani Martín de Hijas-Villalba ◽

Marta Vázquez-Gómez ◽

Samir Id Lahoucine

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Error Rate ◽

Allele Frequencies ◽

Paternity Testing ◽

Sequencing Error ◽

Whole Genome ◽

Genomic Evaluation ◽

Sequencing Error Rate ◽

Low Coverage

Abstract Current European regulations for autochthonous livestock breeds put a special emphasis on pedigree completeness, which requires laboratory paternity testing by genetic markers in most cases. This entails significant economic expenditure for breed societies and precludes other investments in breeding programs, such as genomic evaluation. Within this context, we developed paternity testing through low-coverage whole-genome data in order to reuse these data for genomic evaluation at no cost. Simulations relied on diploid genomes composed by 30 chromosomes (100 cM each) with 3,000,000 SNP per chromosome. Each population evolved during 1,000 non-overlapping generations with effective size 100, mutation rate 10–4, and recombination by Kosambi’s function. Only those populations with 1,000,000 ± 10% polymorphic SNP per chromosome in generation 1,000 were retained for further analyses, and expanded to the required number of parents and offspring. Individuals were sequenced at 0.01, 0.05, 0.1, 0.5 and 1X depth, with 100, 500, 1,000 or 10,000 base-pair reads and by assuming a random sequencing error rate per SNP between 10–2 and 10–5. Assuming known allele frequencies in the population and sequencing error rate, 0.05X depth sufficed to corroborate the true father (85,0%) and to discard other candidates (96,3%). Those percentages increased up to 99,6% and 99,9% with 0,1X depth, respectively (read length = 10,000 bp; smaller read lengths slightly improved the results because they increase the number of sequenced SNP). Results were highly sensitive to biases in allele frequencies and robust to inaccuracies regarding sequencing error rate. Low-coverage whole-genome sequencing data could be subsequently integrated into genomic BLUP equations by appropriately constructing the genomic relationship matrix. This approach increased the correlation between simulated and predicted breeding values by 1.21% (h2 = 0.25; 100 parents and 900 offspring; 0.1X depth by 10,000 bp reads). Although small, this increase opens the door to genomic evaluation in local livestock breeds.

Download Full-text

Demographic history and patterns of molecular evolution from whole genome sequencing in the radiation of Galapagos giant tortoises

Molecular Ecology ◽

10.1111/mec.16176 ◽

2021 ◽

Author(s):

Evelyn L. Jensen ◽

Stephen J. Gaughran ◽

Ryan C. Garrick ◽

Michael A. Russello ◽

Adalgisa Caccone

Keyword(s):

Molecular Evolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Demographic History ◽

Whole Genome

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

Fine Mapping Using Whole-Genome Sequencing Confirms Anti-Müllerian Hormone as a Major Gene for Sex Determination in Farmed Nile Tilapia (Oreochromis niloticus L.)

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400297 ◽

2019 ◽

Vol 9 (10) ◽

pp. 3213-3223 ◽

Cited By ~ 8

Author(s):

Giovanna Cáceres ◽

María E. López ◽

María I. Cádiz ◽

Grazyella M. Yoshida ◽

Ana Jedlicki ◽

...

Keyword(s):

Sex Determination ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Oreochromis Niloticus ◽

Nile Tilapia ◽

Major Gene ◽

Whole Genome ◽

Important Species ◽

Genome Wide ◽

A Genome

Nile tilapia (Oreochromis niloticus) is one of the most cultivated and economically important species in world aquaculture. Intensive production promotes the use of monosex animals, due to an important dimorphism that favors male growth. Currently, the main mechanism to obtain all-male populations is the use of hormones in feeding during larval and fry phases. Identifying genomic regions associated with sex determination in Nile tilapia is a research topic of great interest. The objective of this study was to identify genomic variants associated with sex determination in three commercial populations of Nile tilapia. Whole-genome sequencing of 326 individuals was performed, and a total of 2.4 million high-quality bi-allelic single nucleotide polymorphisms (SNPs) were identified after quality control. A genome-wide association study (GWAS) was conducted to identify markers associated with the binary sex trait (males = 1; females = 0). A mixed logistic regression GWAS model was fitted and a genome-wide significant signal comprising 36 SNPs, spanning a genomic region of 536 kb in chromosome 23 was identified. Ten out of these 36 genetic variants intercept the anti-Müllerian (Amh) hormone gene. Other significant SNPs were located in the neighboring Amh gene region. This gene has been strongly associated with sex determination in several vertebrate species, playing an essential role in the differentiation of male and female reproductive tissue in early stages of development. This finding provides useful information to better understand the genetic mechanisms underlying sex determination in Nile tilapia.

Download Full-text