scholarly journals Development of the Wheat Practical Haplotype Graph Database as a Resource for Genotyping Data Storage and Genotype Imputation

Author(s):  
Katherine W Jordan ◽  
Peter J Bradbury ◽  
Zachary R Miller ◽  
Moses Nyine ◽  
Fei He ◽  
...  

Abstract To improve the efficiency of high-density genotype data storage and imputation in bread wheat (Triticum aestivum L.), we applied the Practical Haplotype Graph (PHG) tool. The wheat PHG database was built using whole-exome capture sequencing data from a diverse set of 65 wheat accessions. Population haplotypes were inferred for the reference genome intervals defined by the boundaries of the high-quality gene models. Missing genotypes in the inference panels, composed of wheat cultivars or recombinant inbred lines genotyped by exome capture, genotyping-by-sequencing (GBS), or whole-genome skim-seq sequencing approaches, were imputed using the wheat PHG database. Though imputation accuracy varied depending on the method of sequencing and coverage depth, we found 92% imputation accuracy with 0.01x sequence coverage, which was slightly lower than the accuracy obtained using the 0.5x sequence coverage (96.6%). Compared to Beagle, on average, PHG imputation was ∼3.5% (p-value < 2 x 10−14) more accurate, and showed 27% higher accuracy at imputing a rare haplotype introgressed from a wild relative into wheat. We found reduced accuracy of imputation with independent 2x GBS data (88.6%), which increases to 89.2% with the inclusion of parental haplotypes in the database. The accuracy reduction with GBS is likely associated with the small overlap between GBS markers and the exome capture dataset, which was used for constructing PHG. The highest imputation accuracy was obtained with exome capture for the wheat D genome, which also showed the highest levels of linkage disequlibrium and proportion of identity-by-descent regions among accessions in the PHG database. We demonstrate that genetic mapping based on genotypes imputed using PHG identifies SNPs with a broader range of effect sizes that together explain a higher proportion of genetic variance for heading date and meiotic crossover rate compared to previous studies.

2021 ◽  
Author(s):  
Katherine Jordan ◽  
Peter Bradbury ◽  
Zachary R Miller ◽  
Moses Nyine ◽  
Fei He ◽  
...  

To improve the efficiency of high-density genotype data storage and imputation in bread wheat (Triticum aestivum L.), we applied the Practical Haplotype Graph (PHG) tool. The wheat PHG database was built using whole-exome capture sequencing data from a diverse set of 65 wheat accessions. Population haplotypes were inferred for the reference genome intervals defined by the boundaries of the high-quality gene models. Missing genotypes in the inference panels, composed of wheat cultivars or recombinant inbred lines genotyped by exome capture, genotyping-by-sequencing (GBS), or whole-genome skim-seq sequencing approaches, were imputed using the wheat PHG database. Though imputation accuracy varied depending on the method of sequencing and coverage depth, we found 93% imputation accuracy with 0.01x sequence coverage, which was only slightly lower than the accuracy obtained using the 0.5x sequence coverage (96.9%). Compared to Beagle, on average, PHG imputation was ~4% (p-value = 0.00027) more accurate, and showed 27% higher accuracy at imputing a rare haplotype introgressed from a wild relative into wheat. The reduced accuracy of imputation with GBS data (90.4%) is likely associated with the small overlap between GBS markers and the exome capture dataset, which was used for constructing PHG. The highest imputation accuracy was obtained with exome capture for the wheat D genome, which also showed the highest levels of linkage disequlibrium and proportion of identity-by-descent regions among accessions in our reference panel. We demonstrate that genetic mapping based on genotypes imputed using PHG identifies SNPs with a broader range of effect sizes that together explain a higher proportion of genetic variance for heading date and meiotic crossover rate compared to previous studies.


2022 ◽  
Author(s):  
Lars Wienbrandt ◽  
David Ellinghaus

Background: Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Methods: We developed EagleImp, a software with algorithmic and technical improvements and new features for accurate and accelerated phasing and imputation in a single tool. Results: We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with more than 1 million reference genomes. EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical GWAS studies, EagleImp provides same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. It has many new features, including automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files, and various user-configurable algorithm and output options. Conclusions: Due to the technical optimisations, EagleImp can perform fast and accurate reference-based phasing and imputation for future very large reference panels with more than 1 million genomes. EagleImp is freely available for download from https://github.com/ikmb/eagleimp.


2019 ◽  
Author(s):  
Seong-Keun Yoo ◽  
Chang-Uk Kim ◽  
Hie Lim Kim ◽  
Sungjae Kim ◽  
Jong-Yeon Shin ◽  
...  

AbstractGenotype imputation using the reference panel is a cost-effective strategy to fill millions of missing genotypes for the purpose of various genetic analyses. Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1,781 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversities of Korean (n=850) and Mongolian (n=386) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for the Northeast Asian populations, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. Also, we illustrate that NARD can potentially improve disease variant discovery by reducing pathogenic candidates. Overall, this study provides a decent reference panel for the genetic studies in Northeast Asia.


2021 ◽  
Vol 12 ◽  
Author(s):  
Hongchun Xiong ◽  
Yuting Li ◽  
Huijun Guo ◽  
Yongdun Xie ◽  
Linshu Zhao ◽  
...  

Agronomic traits such as heading date (HD), plant height (PH), thousand grain weight (TGW), and spike length (SL) are important factors affecting wheat yield. In this study, we constructed a high-density genetic linkage map using the Wheat55K SNP Array to map quantitative trait loci (QTLs) for these traits in 207 recombinant inbred lines (RILs). A total of 37 QTLs were identified, including 9 QTLs for HD, 7 QTLs for PH, 12 QTLs for TGW, and 9 QTLs for SL, which explained 3.0–48.8% of the phenotypic variation. Kompetitive Allele Specific PCR (KASP) markers were developed based on sequencing data and used for validation of the stably detected QTLs on chromosomes 3A, 4B and 6A using 400 RILs. A QTL cluster on chromosome 4B for PH and TGW was delimited to a 0.8 Mb physical interval explaining 12.2–22.8% of the phenotypic variation. Gene annotations and analyses of SNP effects suggested that a gene encoding protein Photosynthesis Affected Mutant 68, which is essential for photosystem II assembly, is a candidate gene affecting PH and TGW. In addition, the QTL for HD on chromosome 3A was narrowed down to a 2.5 Mb interval, and a gene encoding an R3H domain-containing protein was speculated to be the causal gene influencing HD. The linked KASP markers developed in this study will be useful for marker-assisted selection in wheat breeding, and the candidate genes provide new insight into genetic study for those traits in wheat.


Genes ◽  
2021 ◽  
Vol 12 (4) ◽  
pp. 604
Author(s):  
Paolo Vitale ◽  
Fabio Fania ◽  
Salvatore Esposito ◽  
Ivano Pecorella ◽  
Nicola Pecchioni ◽  
...  

Traits such as plant height (PH), juvenile growth habit (GH), heading date (HD), and tiller number are important for both increasing yield potential and improving crop adaptation to climate change. In the present study, these traits were investigated by using the same bi-parental population at early (F2 and F2-derived F3 families) and late (F6 and F7, recombinant inbred lines, RILs) generations to detect quantitative trait loci (QTLs) and search for candidate genes. A total of 176 and 178 lines were genotyped by the wheat Illumina 25K Infinium SNP array. The two genetic maps spanned 2486.97 cM and 3732.84 cM in length, for the F2 and RILs, respectively. QTLs explaining the highest phenotypic variation were found on chromosomes 2B, 2D, 5A, and 7D for HD and GH, whereas those for PH were found on chromosomes 4B and 4D. Several QTL detected in the early generations (i.e., PH and tiller number) were not detected in the late generations as they were due to dominance effects. Some of the identified QTLs co-mapped to well-known adaptive genes (i.e., Ppd-1, Vrn-1, and Rht-1). Other putative candidate genes were identified for each trait, of which PINE1 and PIF4 may be considered new for GH and TTN in wheat. The use of a large F2 mapping population combined with NGS-based genotyping techniques could improve map resolution and allow closer QTL tagging.


Author(s):  
Simon F Lashmar ◽  
Donagh P Berry ◽  
Rian Pierneef ◽  
Farai C Muchadeyi ◽  
Carina Visser

Abstract A major obstacle in applying genomic selection (GS) to uniquely adapted local breeds in less-developed countries has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNP). Cost reduction can be achieved by imputing genotypes from lower to higher densities. Locally adapted breeds tend to be admixed and exhibit a high degree of genomic heterogeneity thus necessitating the optimization of SNP selection for downstream imputation. The aim of this study was to quantify the achievable imputation accuracy for a sample of 1,135 South African (SA) Drakensberger using several custom-derived lower-density panels varying in both SNP density and how the SNP were selected. From a pool of 120,608 genotyped SNP, subsets of SNP were chosen 1) at random, 2) with even genomic dispersion, 3) by maximizing the mean minor allele frequency (MAF), 4) using a combined score of MAF and linkage disequilibrium (LD), 5) using a partitioning-around-medoids (PAM) algorithm, and finally 6) using a hierarchical LD-based clustering algorithm. Imputation accuracy to higher density improved as SNP density increased; animal-wise imputation accuracy defined as the within-animal correlation between the imputed and actual alleles ranged from 0.625 to 0.990 when 2,500 randomly selected SNP were chosen versus a range of 0.918 to 0.999 when 50,000 randomly selected SNP were used. At a panel density of 10,000 SNP, the mean (standard deviation) animal-wise allele concordance rate was 0.976 (0.018) versus 0.982 (0.014) when the worst (i.e., random) as opposed to the best (i.e., combination of MAF and LD) SNP selection strategy was employed. A difference of 0.071 units was observed between the mean correlation-based accuracy of imputed SNP categorized as low (0.01<MAF≤0.1) versus high MAF (0.4<MAF≤0.5). Greater mean imputation accuracy was achieved for SNP located on autosomal extremes when these regions were populated with more SNP. The presented results suggested that genotype imputation can be a practical cost-saving strategy for indigenous breeds such as the South African Drakensberger. Based on the results, a genotyping panel consisting of approximately 10,000 SNP selected based on a combination of MAF and LD would suffice in achieving a less than 3% imputation error rate for a breed characterized by genomic admixture on the condition that these SNP are selected based on breed-specific selection criteria.


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
L Girardi ◽  
M Serdaroğulları ◽  
C Patassini ◽  
S Caroselli ◽  
M Costa ◽  
...  

Abstract Study question What is the effect of varying diagnostic thresholds on the accuracy of Next Generation Sequencing (NGS)-based preimplantation genetic testing for aneuploidies (PGT-A)? Summary answer When single trophectoderm biopsies are tested, the employment of 80% upper threshold increases mosaic calls and false negative aneuploidy results compared to more stringent thresholds. What is known already Trophectoderm (TE) biopsy coupled with NGS-based PGT-A technologies are able to accurately predict Inner Cell Mass’ (ICM) constitution when uniform whole chromosome aneuploidies are considered. However, minor technical and biological inconsistencies in NGS procedures and biopsy specimens can result in subtle variability in analytical results. In this context, the stringency of thresholds employed for diagnostic calls can lead to incorrect classification of uniformly aneuploid embryos into the mosaic category, ultimately affecting PGT-A accuracy. In this study, we evaluated the diagnostic predictivity of different aneuploidy classification criteria by employing blinded analysis of chromosome copy number values (CNV) in multifocal blastocyst biopsies. Study design, size, duration The accuracy of different aneuploidy diagnostic cut-offs was assessed comparing chromosomal CNV in intra-blastocysts multifocal biopsies. Enrolled embryos were donated for research between June and September 2020. The Institutional Review Board at the Near East University approved the study (project: YDU/20l9/70–849). Embryos diagnosed with uniform chromosomal alterations (single or multiple) in their clinical TE biopsy (n = 27) were disaggregated into 5 portions: the ICM and 4 TE biopsies. Overall, 135 specimens were collected and analysed. Participants/materials, setting, methods Twenty-seven donated blastocysts were warmed and disaggregated in TE biopsies and ICM (n = 135 biopsies). PGT-A analysis was performed using Ion ReproSeq PGS kit and Ion S5 sequencer (ThermoFisher). Sequencing data were blindly analysed with Ion-Reporter software. Intra-blastocyst comparison of raw NGS data was performed employing different thresholds commonly used for aneuploidy classification. CNV for each chromosome were reported as aneuploid according to 70% or 80% thresholds. Categorical variables were compared using Fisher’s exact test. Main results and the role of chance In this study, a total of 50 aneuploid patterns in 27 disaggregated embryos were explored. Single TE biopsy results were considered as true positive when they displayed the same alteration detected in the ICM at levels above the 70% or 80% thresholds. Alternatively, alterations detected in the euploid or mosaic range were considered as false negative aneuploidy results. When the 70% threshold was applied, aneuploidy findings were confirmed in 94.5% of TE biopsies analyzed (n = 189/200; 95%CI=90.37–37.22), while 5.5% showed a mosaic profile (50–70%) but uniformly abnormal ICM. Positive (PPV) and negative predictive value (NPV) per chromosome were 100.0% (n = 189/189; 95%CI=98.07–100.00) and 99.5% (n = 2192/2203; 95%CI=99.11–99.75) respectively. When the upper cut-off was experimentally placed at 80% of abnormal cells, a significant decrease (p-value=0.0097) in the percentage of confirmed aneuploid calls was observed (86.5%; n = 173/200; 95%CI=80.97–90.91), resulting in mosaicism overcalling, especially in the high range (50–80%). Less stringent thresholds led to extremely high PPV (100.0%; n = 173/173; 95%CI=97.89–100.00), while NPV decreased to 98.8% (n = 2192/2219; 95%CI=98.30–99.23). Furthermore, no additional true mosaic patterns were identified with the use of wide range thresholds for aneuploidy classification. Limitations, reasons for caution This approach involved the analysis of aneuploidy CNV thresholds at the embryo level and lacked from genotyping-based confirmation analysis. Moreover, aneuploid embryos with known meiotic partial deletion/duplication were not included. Wider implications of the findings: The use of wide thresholds for detecting intermediate chromosomal CNV up to 80% doesn’t improve PGT-A ability to discriminate true mosaic from uniformly aneuploid embryos, lowering overall diagnostic accuracy. Hence, a proportion of the embryos diagnosed as mosaic using wide calling thresholds may actually be uniformly aneuploid and inadvertently transferred. Trial registration number N/A


Author(s):  
Wei Zhang ◽  
Longlong Wang ◽  
Ke Liu ◽  
Xiaofeng Wei ◽  
Kai Yang ◽  
...  

Abstract Motivation T and B cell receptors (TCRs and BCRs) play a pivotal role in the adaptive immune system by recognizing an enormous variety of external and internal antigens. Understanding these receptors is critical for exploring the process of immunoreaction and exploiting potential applications in immunotherapy and antibody drug design. Although a large number of samples have had their TCR and BCR repertoires sequenced using high-throughput sequencing in recent years, very few databases have been constructed to store these kinds of data. To resolve this issue, we developed a database. Results We developed a database, the Pan Immune Repertoire Database (PIRD), located in China National GeneBank (CNGBdb), to collect and store annotated TCR and BCR sequencing data, including from Homo sapiens and other species. In addition to data storage, PIRD also provides functions of data visualization and interactive online analysis. Additionally, a manually curated database of TCRs and BCRs targeting known antigens (TBAdb) was also deposited in PIRD. Availability and implementation PIRD can be freely accessed at https://db.cngb.org/pird.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Fei Chen ◽  
Yunfeng Song ◽  
Xiaojiang Li ◽  
Junhao Chen ◽  
Lan Mo ◽  
...  

Abstract Horticultural plants play various and critical roles for humans by providing fruits, vegetables, materials for beverages, and herbal medicines and by acting as ornamentals. They have also shaped human art, culture, and environments and thereby have influenced the lifestyles of humans. With the advent of sequencing technologies, there has been a dramatic increase in the number of sequenced genomes of horticultural plant species in the past decade. The genomes of horticultural plants are highly diverse and complex, often with a high degree of heterozygosity and a high ploidy due to their long and complex history of evolution and domestication. Here we summarize the advances in the genome sequencing of horticultural plants, the reconstruction of pan-genomes, and the development of horticultural genome databases. We also discuss past, present, and future studies related to genome sequencing, data storage, data quality, data sharing, and data visualization to provide practical guidance for genomic studies of horticultural plants. Finally, we propose a horticultural plant genome project as well as the roadmap and technical details toward three goals of the project.


Sign in / Sign up

Export Citation Format

Share Document