Genotype calling and haplotype inference from low coverage sequence data in heterozygous plant genome using HetMap

Mapping Intimacies ◽

10.21203/rs.3.rs-1220819/v1 ◽

2022 ◽

Author(s):

Hao Gong ◽

Bin Han

Keyword(s):

Wild Rice ◽

Hybrid Rice ◽

Sequence Data ◽

Genotype Imputation ◽

Plant Genome ◽

High Coverage ◽

Software Packages ◽

Heterozygous Plant ◽

Low Coverage ◽

Genotype Inference

Abstract Many software packages and pipelines had been developed to handle the sequence data of the model species. However, Genotyping from complex heterozygous plant genome needs further improvement on the previous methods. Here we present a new pipeline available at https://github.com/Ncgrhg/HetMapv1) for variant calling and missing genotype imputation from low coverage sequence data for heterozygous plant genomes. To check the performance of the HetMap on the real sequence data, HetMap was applied to both the F1 hybrid rice population which consists of 1495 samples and wild rice population with 446 samples. Four high coverage sequence hybrid rice accessions and two high coverage sequence wild rice accessions, which were also included in low coverage sequence data, are used to validate the genotype inference accuracy. The validation results showed that HetMap archived significant improvement in heterozygous genotype inference accuracy (13.65% for hybrid rice, 26.05% for wild rice) and total accuracy compared with other similar software packages. The application of the new genotype with the genome wide association study also showed improvement of association power in two wild rice phenotypes. It could archive high genotype inference accuracy with low sequence coverage with a small population size with both the natural population and constructed recombination population. HetMap provided a powerful tool for the heterozygous plant genome sequence data analysis, which may help the discover of new phenotype regions for the plant species with complex heterozygous genome.

Download Full-text

High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

10.1101/2021.02.06.430068 ◽

2021 ◽

Cited By ~ 4

Author(s):

Marta Byrska-Bishop ◽

Uday S. Evani ◽

Xuefang Zhao ◽

Anna O. Basile ◽

Haley J. Abel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome ◽

1000 Genomes Project ◽

Phase 3 ◽

High Coverage ◽

Entire Cohort ◽

1000 Genomes ◽

Low Coverage

ABSTRACTThe 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. The final (phase 3) 2015 release of 1kGP included 2,504 unrelated samples from 26 populations, representing five continental regions of the world and was based on a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping. Here, we present a new, high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. Using this strategy, we defined over 111 million SNVs, 14 million INDELs, and ∼170 thousand SVs across the entire cohort of 3,202 samples with estimated false discovery rate (FDR) of 0.3%, 1.0%, and 1.8%, respectively. By comparison to the low-coverage phase 3 callset, we observed substantial improvements in variant discovery and estimated FDR that were facilitated by high coverage re-sequencing and expansion of the cohort. Specifically, we called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics.

Download Full-text

PHARP: A pig haplotype reference panel for genotype imputation

10.1101/2021.06.03.446888 ◽

2021 ◽

Author(s):

Zhen Wang ◽

Zhenyang Zhang ◽

Zitao Chen ◽

Jiabao Sun ◽

Caiyun Cao ◽

...

Keyword(s):

Complex Traits ◽

Sequence Data ◽

Genotype Imputation ◽

Reference Panel ◽

Whole Genome Sequence ◽

Sequencing Data ◽

Large White ◽

Downstream Analysis ◽

Low Coverage ◽

Analytical Tools

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.

Download Full-text

Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes

Scientific Reports ◽

10.1038/s41598-020-75387-w ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Ruoyun Hui ◽

Eugenia D’Atanasio ◽

Lara M. Cassidy ◽

Christiana L. Scheib ◽

Toomas Kivisild

Keyword(s):

Ancient Dna ◽

Imputation Accuracy ◽

Single Step ◽

Genotype Imputation ◽

Common Variants ◽

High Coverage ◽

Input Mode ◽

Missing Genotypes ◽

Low Coverage ◽

Reference Bias

Abstract Although ancient DNA data have become increasingly more important in studies about past populations, it is often not feasible or practical to obtain high coverage genomes from poorly preserved samples. While methods of accurate genotype imputation from > 1 × coverage data have recently become a routine, a large proportion of ancient samples remain unusable for downstream analyses due to their low coverage. Here, we evaluate a two-step pipeline for the imputation of common variants in ancient genomes at 0.05–1 × coverage. We use the genotype likelihood input mode in Beagle and filter for confident genotypes as the input to impute missing genotypes. This procedure, when tested on ancient genomes, outperforms a single-step imputation from genotype likelihoods, suggesting that current genotype callers do not fully account for errors in ancient sequences and additional quality controls can be beneficial. We compared the effect of various genotype likelihood calling methods, post-calling, pre-imputation and post-imputation filters, different reference panels, as well as different imputation tools. In a Neolithic Hungarian genome, we obtain ~ 90% imputation accuracy for heterozygous common variants at coverage 0.05 × and > 97% accuracy at coverage 0.5 ×. We show that imputation can mitigate, though not eliminate reference bias in ultra-low coverage ancient genomes.

Download Full-text

Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

10.21203/rs.3.rs-607740/v1 ◽

2021 ◽

Author(s):

Changheng Zhao ◽

Jun Teng ◽

Xinhao Zhang ◽

Dan Wang ◽

Xinyi Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genomic Selection ◽

Genome Sequencing ◽

Sequence Data ◽

Low Cost ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Low Coverage

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.

Download Full-text

Impact of index hopping and bias towards the reference allele on accuracy of genotype calls from low-coverage sequencing

10.1101/358085 ◽

2018 ◽

Cited By ~ 1

Author(s):

Roger Ros-Freixedes ◽

Battagin Mara ◽

Martin Johnsson ◽

Gregor Gorjanc ◽

Alan J Mileham ◽

...

Keyword(s):

Sequence Data ◽

Cost Effective ◽

Reference Allele ◽

High Coverage ◽

Sequencing Errors ◽

Percentage Points ◽

Low Coverage ◽

The Impact ◽

Genotype Concordance

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.

Download Full-text

Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost efficient approach

10.1101/421768 ◽

2018 ◽

Author(s):

Yanjun Zan ◽

Thibaut Payen ◽

Mette Lillie ◽

Christa F. Honaker ◽

Paul B. Siegel ◽

...

Keyword(s):

High Resolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Genotype Imputation ◽

Whole Genome ◽

Efficient Manner ◽

Founder Line ◽

Cost Efficient ◽

Low Coverage

ABSTRACTBackgroundExperimental intercrosses between outbred founder populations are powerful resources for mapping loci contributing to complex traits (Quantitative Trait Loci or QTL). Here, we present an approach and accompanying software for high-resolution genotype imputation in such populations using whole-genome high coverage sequence data on founder individuals (∼30×) and low coverage sequence data on intercross individuals (∼0.4×). The method is illustrated in a large F2 pedigree between lines of chickens that have been divergently selected for 40 generations for the same trait (body weight at 8 weeks of age).ResultsDescribed is how hundreds of individuals were whole-genome sequenced in a cost- and time-efficient manner using a Tn5-based library preparation protocol optimized for this application. In total, 7.6M markers segregated in this pedigree and 10.0 to 13.7% were informative for imputing the founder line genotypes within the F0-F2 families. The genotypes imputed from low coverage sequence data were consistent with the founder line genotypes estimated using SNP and microsatellite markers both at individual imputed sites (92%) and across the genome of individual chickens (93%). The resolution of the recombination breakpoints was high with 50% being resolved within <10kb.ConclusionsA method for genotype imputation from low-coverage whole-genome sequencing in outbred intercrosses is described and evaluated. By applying it to an outbred chicken F2 cross it is illustrated that it provides high quality, high-resolution genotypes in a time and cost efficient manner.

Download Full-text

Genomic Rearrangements Considered as Quantitative Traits

10.1101/087387 ◽

2016 ◽

Author(s):

Martha Imprialou ◽

André Kahles ◽

Joshua G. Steffen ◽

Edward J. Osborne ◽

Xiangchao Gan ◽

...

Keyword(s):

Gene Expression ◽

Quantitative Traits ◽

Sequence Data ◽

Expression Study ◽

High Coverage ◽

Fungal Disease Resistance ◽

Large Populations ◽

Study Association ◽

Low Coverage ◽

Population Sequence

AbstractTo understand the population genetics of structural variants (SVs), and their effects on phenotypes, we developed an approach to mapping SVs, particularly transpositions, segregating in a sequenced population, and which avoids calling SVs directly. The evidence for a potential SV at a locus is indicated by variation in the counts of short-reads that map anomalously to the locus. These SV traits are treated as quantitative traits and mapped genetically, analogously to a gene expression study. Association between an SV trait at one locus and genotypes at a distant locus indicate the origin and target of a transposition. Using ultra-low-coverage (0.3x) population sequence data from 488 recombinant inbred Arabidopsis genomes, we identified 6,502 segregating SVs. Remarkably, 25% of these were transpositions. Whilst many SVs cannot be delineated precisely, PCR validated 83% of 44 predicted transposition breakpoints. We show that specific SVs may be causative for quantitative trait loci for germination, fungal disease resistance and other phenotypes. Further we show that the phenotypic heritability attributable to sequence anomalies differs from, and in the case of time to germination and bolting, exceeds that due to standard genetic variation. Gene expression within SVs is also more likely to be silenced or dysregulated. This approach is generally applicable to large populations sequenced at low-coverage, and complements the prevalent strategy of SV discovery in fewer individuals sequenced at high coverage.

Download Full-text

Genotyping Strategies Using ddRAD Sequencing in Farmed Arctic Charr (Salvelinus alpinus)

Animals ◽

10.3390/ani11030899 ◽

2021 ◽

Vol 11 (3) ◽

pp. 899

Author(s):

Fotis Pappas ◽

Christos Palaiokostas

Keyword(s):

Genetic Diversity ◽

Salvelinus Alpinus ◽

Arctic Charr ◽

Cost Effective ◽

Limiting Factors ◽

Phenotypic Traits ◽

Production Volume ◽

High Coverage ◽

Genomic Relationships ◽

Low Coverage

Incorporation of genomic technologies into fish breeding programs is a modern reality, promising substantial advances regarding the accuracy of selection, monitoring the genetic diversity and pedigree record verification. Single nucleotide polymorphism (SNP) arrays are the most commonly used genomic tool, but the investments required make them unsustainable for emerging species, such as Arctic charr (Salvelinus alpinus), where production volume is low. The requirement to genotype a large number of animals for breeding practices necessitates cost effective genotyping approaches. In the current study, we used double digest restriction site-associated DNA (ddRAD) sequencing of either high or low coverage to genotype Arctic charr from the Swedish national breeding program and performed analytical procedures to assess their utility in a range of tasks. SNPs were identified and used for deciphering the genetic structure of the studied population, estimating genomic relationships and implementing an association study for growth-related traits. Missing information and underestimation of heterozygosity in the low coverage set were limiting factors in genetic diversity and genomic relationship analyses, where high coverage performed notably better. On the other hand, the high coverage dataset proved to be valuable when it comes to identifying loci that are associated with phenotypic traits of interest. In general, both genotyping strategies offer sustainable alternatives to hybridization-based genotyping platforms and show potential for applications in aquaculture selective breeding.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Genome diversity in Ukraine

GigaScience ◽

10.1093/gigascience/giaa159 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taras K Oleksyk ◽

Walter W Wolfsberger ◽

Alexandra M Weber ◽

Khrystyna Shchubelka ◽

Olga T Oleksyk ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genomic Variation ◽

High Coverage ◽

Genome Data ◽

New Information ◽

Genome Wide ◽

Public Data ◽

Genome Wide Data ◽

Multiple Samples

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.

Download Full-text