scholarly journals SPEARS: Standard Performance Evaluation of Ancestral haplotype Reconstruction through Simulation

Author(s):  
Heather Manching ◽  
Randall J Wisser

Abstract Motivation Ancestral haplotype maps provide useful information about genomic variation and insights into biological processes. Reconstructing the descendent haplotype structure of homologous chromosomes, particularly for large numbers of individuals, can help with characterizing the recombination landscape, elucidating genotype-to-phenotype relationships, improving genomic predictions and more. Inferring haplotype maps from sparse genotype data is an efficient approach to whole-genome haplotyping, but this is a non-trivial problem. A standardized approach is needed to validate whether haplotype reconstruction software, conceived population designs and existing data for a given population provides accurate haplotype information for further inference. Results We introduce SPEARS, a pipeline for the simulation-based appraisal of genome-wide haplotype maps constructed from sparse genotype data. Using a specified pedigree, the pipeline generates virtual genotypes (known data) with genotyping errors and missing data structure. It then proceeds to mimic analysis in practice, capturing sources of error due to genotyping, imputation and haplotype inference. Standard metrics allow researchers to assess different population designs and which features of haplotype structure or regions of the genome are sufficiently accurate for analysis. Haplotype maps for 1000 outcross progeny from a multi-parent population of maize are used to demonstrate SPEARS. Availabilityand implementation SPEARS, the protocol and suite of scripts, are publicly available under an MIT license at GitHub (https://github.com/maizeatlas/spears).. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
H. Manching ◽  
R. J. Wisser

MotivationAncestral haplotype maps provide useful information about genomic variation and biological processes. Reconstructing the descendent haplotype structure of homologous chromosomes, particularly for large numbers of individuals, can help with characterizing the recombination landscape, elucidating genotype-to-phenotype relationships, improving genomic predictions and more. Inferring haplotype maps from sparse genotype data is an efficient approach to whole-genome haplotyping, but this is a non-trivial problem. A standardized approach is needed to validate whether haplotype reconstruction software, conceived population designs and existing data for a given population provides accurate haplotype information for further inference.ResultsWe introduce SPEARS, a pipeline for whole simulation-based appraisal of genome-wide ancestral haplotype inference. The pipeline generates virtual genotypes (truth data) with real-world missing data structure. It then proceeds to mimic analysis in practice, capturing sources of error due to imputation and reconstruction of ancestral haplotypes. Standard metrics allow researchers to assess which features of haplotype structure or regions of the genome are sufficiently accurate for analysis and reporting. Haplotype maps for 1,000 outcross progeny from a multi-parent population of maize is used to demonstrate SPEARS.Availabilityhttps://github.com/maizeatlas/spears


2019 ◽  
Vol 35 (19) ◽  
pp. 3852-3854 ◽  
Author(s):  
You Tang ◽  
Xiaolei Liu

Abstract Motivation Plenty of Genome-Wide-Association-Study (GWAS) methods have been developed for mapping genetic markers that associated with human diseases and agricultural economic traits. Computer simulation is a nice tool to test the performances of various GWAS methods under certain scenarios. Existing tools are either inefficient in terms of computation and memory efficiency or inconvenient to use to simulate big, realistic genotype data and phenotype data to evaluate available GWAS methods. Results Here, we present a GWAS simulation tool named G2P that can be used to simulate genotype data, phenotype data and perform power evaluation of GWAS methods. G2P is a user-friendly tool with all functions is provided in both graphical user interface and pipeline manners and it is available for Windows, Mac and Linux environments. Furthermore, G2P achieves maximum efficiency in terms of both memory usage and simulation speed; with G2P, the simulation of genotype data that includes 1 000 000 samples and 2 000 000 markers can be accomplished in 5 h. Availability and implementation The G2P software, user manual, and example datasets are freely available at GitHub: https://github.com/XiaoleiLiuBio/G2P. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Francisco J. Esteban ◽  
Peter J. Tonellato ◽  
Dennis P. Wall

AbstractThe genetic heterogeneity of autism has stymied the search for causes and cures. Even whole-genomic studies on large numbers of families have yielded results of relatively little impact. In the present work, we analyze two genomic databases using a novel strategy that takes prior knowledge of genetic relationships into account and that was designed to boost signal important to our understanding of the molecular basis of autism. Our strategy was designed to identify significant genomic variation within a priori defined biological concepts and improves signal detection while lessening the severity of multiple test correction seen in standard analysis of genome-wide association data. Upon application of our approach using 3,244 biological concepts, we detected genomic variation in 68 biological concepts with significant association to autism in comparison to family-based controls. These concepts clustered naturally into a total of 19 classes, principally including cell adhesion, cancer, and immune response. The top-ranking concepts contained high percentages of genes already suspected to play roles in autism or in a related neurological disorder. In addition, many of the sets associated with autism at the DNA level also proved to be predictive of changes in gene expression within a separate population of autistic cases, suggesting that the signature of genomic variation may also be detectable in blood-based transcriptional profiles. This robust cross-validation with gene expression data from individuals with autism coupled with the enrichment within autism-related neurological disorders supported the possibility that the mutations play important roles in the onset of autism and should be given priority for further study. In sum, our work provides new leads into the genetic underpinnings of autism and highlights the importance of reanalysis of genomic studies of complex disease using prior knowledge of genetic organization.Author SummaryThe genetic heterogeneity of autism has stymied the search for causes and cures. Even whole-genomic studies on large numbers of families have yielded results of relatively little impact. In the present work, we reanalyze two of the most influential whole-genomic studies using a novel strategy that takes prior knowledge of genetic relationships into account in an effort to boost signal important to our understanding of the molecular structure of autism. Our approach demonstrates that these genome wide association studies contain more information relevant to autism than previously realized. We detected 68 highly significant collections of mutations that map to genes with measurable and significant changes in gene expression in autistic individuals, and that have been implicated in other neurological disorders believed to be closely related, and genetically linked, to autism. Our work provides leads into the genetic underpinnings of autism and highlights the importance of reanalysis of genomic studies of disease using prior knowledge of genetic organization.


2020 ◽  
Author(s):  
Christopher M. Ward ◽  
Alastair J. Ludington ◽  
James Breen ◽  
Simon W. Baxter

AbstractThe analysis and interpretation of datasets generated through sequencing large numbers of individual genomes is becoming commonplace in population and evolutionary genetic studies. Here we introduce geaR, a modular R package for evolutionary analysis of genome-wide genotype data. The package leverages the Genomic Data Structure (GDS) format, which enables memory and time efficient querying of genotype datasets compared to standard VCF genotype files. geaR utilizes GRange object classes to partition an analysis based on features from GFF annotation files, select codons based on position or degeneracy, and construct both positional and coordinate genomic windows. Tests of genetic diversity (eg. dXY, π, FST) and admixture along with tree building and sequence output, can be carried out on partitions using a single function regardless of sample ploidy or number of observed alleles. The package and associated documentation are available on GitHub at https://github.com/CMWbio/geaR.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Kelly B. Klingler ◽  
Joshua P. Jahner ◽  
Thomas L. Parchman ◽  
Chris Ray ◽  
Mary M. Peacock

Abstract Background Distributional responses by alpine taxa to repeated, glacial-interglacial cycles throughout the last two million years have significantly influenced the spatial genetic structure of populations. These effects have been exacerbated for the American pika (Ochotona princeps), a small alpine lagomorph constrained by thermal sensitivity and a limited dispersal capacity. As a species of conservation concern, long-term lack of gene flow has important consequences for landscape genetic structure and levels of diversity within populations. Here, we use reduced representation sequencing (ddRADseq) to provide a genome-wide perspective on patterns of genetic variation across pika populations representing distinct subspecies. To investigate how landscape and environmental features shape genetic variation, we collected genetic samples from distinct geographic regions as well as across finer spatial scales in two geographically proximate mountain ranges of eastern Nevada. Results Our genome-wide analyses corroborate range-wide, mitochondrial subspecific designations and reveal pronounced fine-scale population structure between the Ruby Mountains and East Humboldt Range of eastern Nevada. Populations in Nevada were characterized by low genetic diversity (π = 0.0006–0.0009; θW = 0.0005–0.0007) relative to populations in California (π = 0.0014–0.0019; θW = 0.0011–0.0017) and the Rocky Mountains (π = 0.0025–0.0027; θW = 0.0021–0.0024), indicating substantial genetic drift in these isolated populations. Tajima’s D was positive for all sites (D = 0.240–0.811), consistent with recent contraction in population sizes range-wide. Conclusions Substantial influences of geography, elevation and climate variables on genetic differentiation were also detected and may interact with the regional effects of anthropogenic climate change to force the loss of unique genetic lineages through continued population extirpations in the Great Basin and Sierra Nevada.


GigaScience ◽  
2021 ◽  
Vol 10 (1) ◽  
Author(s):  
Taras K Oleksyk ◽  
Walter W Wolfsberger ◽  
Alexandra M Weber ◽  
Khrystyna Shchubelka ◽  
Olga T Oleksyk ◽  
...  

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.


Author(s):  
Julia Markowski ◽  
Rieke Kempfer ◽  
Alexander Kukalev ◽  
Ibai Irastorza-Azcarate ◽  
Gesa Loof ◽  
...  

Abstract Motivation Genome Architecture Mapping (GAM) was recently introduced as a digestion- and ligation-free method to detect chromatin conformation. Orthogonal to existing approaches based on chromatin conformation capture (3C), GAM’s ability to capture both inter- and intra-chromosomal contacts from low amounts of input data makes it particularly well suited for allele-specific analyses in a clinical setting. Allele-specific analyses are powerful tools to investigate the effects of genetic variants on many cellular phenotypes including chromatin conformation, but require the haplotypes of the individuals under study to be known a-priori. So far however, no algorithm exists for haplotype reconstruction and phasing of genetic variants from GAM data, hindering the allele-specific analysis of chromatin contact points in non-model organisms or individuals with unknown haplotypes. Results We present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies from GAM data and employs a GAM-specific probabilistic model of haplotype capture to optimise phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAMIBHEAR to infer accurate genome-wide haplotypes from GAM data. Availability GAMIBHEAR is available as an R package under the open source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear Maintainer [email protected] Supplementary information Supplementary information is available at Bioinformatics online.


Author(s):  
Zachary F Gerring ◽  
Angela Mina-Vargas ◽  
Eric R Gamazon ◽  
Eske M Derks

Abstract Motivation Genome-wide association studies have successfully identified multiple independent genetic loci that harbour variants associated with human traits and diseases, but the exact causal genes are largely unknown. Common genetic risk variants are enriched in non-protein-coding regions of the genome and often affect gene expression (expression quantitative trait loci, eQTL) in a tissue-specific manner. To address this challenge, we developed a methodological framework, E-MAGMA, which converts genome-wide association summary statistics into gene-level statistics by assigning risk variants to their putative genes based on tissue-specific eQTL information. Results We compared E-MAGMA to three eQTL informed gene-based approaches using simulated phenotype data. Phenotypes were simulated based on eQTL reference data using GCTA for all genes with at least one eQTL at chromosome 1. We performed 10 simulations per gene. The eQTL-h2 (i.e., the proportion of variation explained by the eQTLs) was set at 1%, 2%, and 5%. We found E-MAGMA outperforms other gene-based approaches across a range of simulated parameters (e.g. the number of identified causal genes). When applied to genome-wide association summary statistics for five neuropsychiatric disorders, E-MAGMA identified more putative candidate causal genes compared to other eQTL-based approaches. By integrating tissue-specific eQTL information, these results show E-MAGMA will help to identify novel candidate causal genes from genome-wide association summary statistics and thereby improve the understanding of the biological basis of complex disorders. Availability A tutorial and input files are made available in a github repository: https://github.com/eskederks/eMAGMA-tutorial. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Cooper J. Park ◽  
Nicole A. Caimi ◽  
Debbie C. Buecher ◽  
Ernest W. Valdez ◽  
Diana E. Northup ◽  
...  

Abstract Background Antibiotic-producing Streptomyces bacteria are ubiquitous in nature, yet most studies of its diversity have focused on free-living strains inhabiting diverse soil environments and those in symbiotic relationship with invertebrates. Results We studied the draft genomes of 73 Streptomyces isolates sampled from the skin (wing and tail membranes) and fur surfaces of bats collected in Arizona and New Mexico. We uncovered large genomic variation and biosynthetic potential, even among closely related strains. The isolates, which were initially identified as three distinct species based on sequence variation in the 16S rRNA locus, could be distinguished as 41 different species based on genome-wide average nucleotide identity. Of the 32 biosynthetic gene cluster (BGC) classes detected, non-ribosomal peptide synthetases, siderophores, and terpenes were present in all genomes. On average, Streptomyces genomes carried 14 distinct classes of BGCs (range = 9–20). Results also revealed large inter- and intra-species variation in gene content (single nucleotide polymorphisms, accessory genes and singletons) and BGCs, further contributing to the overall genetic diversity present in bat-associated Streptomyces. Finally, we show that genome-wide recombination has partly contributed to the large genomic variation among strains of the same species. Conclusions Our study provides an initial genomic assessment of bat-associated Streptomyces that will be critical to prioritizing those strains with the greatest ability to produce novel antibiotics. It also highlights the need to recognize within-species variation as an important factor in genetic manipulation studies, diversity estimates and drug discovery efforts in Streptomyces.


Sign in / Sign up

Export Citation Format

Share Document