scholarly journals AlphaFamImpute: high-accuracy imputation in full-sib families from genotype-by-sequencing data

2020 ◽  
Vol 36 (15) ◽  
pp. 4369-4371
Author(s):  
Andrew Whalen ◽  
Gregor Gorjanc ◽  
John M Hickey

Abstract Summary AlphaFamImpute is an imputation package for calling, phasing and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data are increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here, we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step, it phases and imputes parental genotypes based on the segregation states of their offspring (i.e. which pair of parental haplotypes the offspring inherited). In the second step, it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations, we find that AlphaFamImpute obtains high-accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at <1x coverage. Availability and implementation AlphaFamImpute is available as a Python package from the AlphaGenes website http://www.AlphaGenes.roslin.ed.ac.uk/AlphaFamImpute. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Andrew Whalen ◽  
Gregor Gorjanc ◽  
John M Hickey

AbstractSummaryAlphaFamImpute is an imputation package for calling, phasing, and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data is increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step it phases and imputes parental genotypes based on the segregation states of their offspring (that is, which pair of parental haplotypes the offspring inherited). In the second step it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations we find that AlphaFamImpute obtains high accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at less than 1x coverage.Availability and implementationAlphaFamImpute is available as a Python package from the AlphaGenes website, http://www.AlphaGenes.roslin.ed.ac.uk/[email protected] informationA complete description of the methods is available in the supplementary information.


2017 ◽  
Author(s):  
Zilu Zhou ◽  
Weixin Wang ◽  
Li-San Wang ◽  
Nancy Ruonan Zhang

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Alli L. Gombolay ◽  
Francesca Storici

ABSTRACTRibose-Map is a user-friendly, standardized bioinformatics toolkit for the comprehensive analysis of ribonucleotide sequencing experiments. It allows researchers to map the locations of ribonucleotides in DNA to single-nucleotide resolution and identify biological signatures of ribonucleotide incorporation. In addition, it can be applied to data generated using any currently available high-throughput ribonucleotide sequencing technique, thus standardizing the analysis of ribonucleotide sequencing experiments and allowing direct comparisons of results. This protocol describes in detail how to use Ribose-Map to analyze raw ribonucleotide sequencing data, including preparing the reads for analysis, locating the genomic coordinates of ribonucleotides, exploring the genome-wide distribution of ribonucleotides, determining the nucleotide sequence context of ribonucleotides, and identifying hotspots of ribonucleotide incorporation. Ribose-Map does not require background knowledge of ribonucleotide sequencing analysis and assumes only basic command-line skills. The protocol requires less than 3 hr of computing time for most datasets and about 30 min of hands-on time.


2021 ◽  
Author(s):  
Scott T O’Donnell ◽  
Sorel T Fitz-Gibbon ◽  
Victoria L Sork

Abstract Ancient introgression can be an important source of genetic variation that shapes the evolution and diversification of many taxa. Here, we estimate the timing, direction and extent of gene flow between two distantly related oak species in the same section (Quercus sect. Quercus). We estimated these demographic events using genotyping by sequencing data (GBS), which generated 25,702 single nucleotide polymorphisms (SNPs) for 24 individuals of California scrub oak (Quercus berberidifolia) and 23 individuals of Engelmann oak (Q. engelmannii). We tested several scenarios involving gene flow between these species using the diffusion approximation-based population genetic inference framework and model-testing approach of the Python package DaDi. We found that the most likely demographic scenario includes a bottleneck in Q. engelmannii that coincides with asymmetric gene flow from Q. berberidifolia into Q. engelmannii. Given that the timing of this gene flow coincides with the advent of a Mediterranean-type climate in the California Floristic Province, we propose that changing precipitation patterns and seasonality may have favored the introgression of climate-associated genes from the endemic into the non-endemic California oak.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i186-i193
Author(s):  
Matthew A Myers ◽  
Simone Zaccaria ◽  
Benjamin J Raphael

Abstract Motivation Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5× per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that are too sparse for current single-cell analysis methods. Results We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2×. We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV solution with sequencing coverage ≈0.03×, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage ≈0.5×, SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset. Availability and implementation SBMClone is available on the GitHub repository https://github.com/raphael-group/SBMClone. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (17) ◽  
pp. 2924-2931
Author(s):  
Mark R Zucker ◽  
Lynne V Abruzzo ◽  
Carmen D Herling ◽  
Lynn L Barron ◽  
Michael J Keating ◽  
...  

Abstract Motivation Clonal heterogeneity is common in many types of cancer, including chronic lymphocytic leukemia (CLL). Previous research suggests that the presence of multiple distinct cancer clones is associated with clinical outcome. Detection of clonal heterogeneity from high throughput data, such as sequencing or single nucleotide polymorphism (SNP) array data, is important for gaining a better understanding of cancer and may improve prediction of clinical outcome or response to treatment. Here, we present a new method, CloneSeeker, for inferring clinical heterogeneity from sequencing data, SNP array data, or both. Results We generated simulated SNP array and sequencing data and applied CloneSeeker along with two other methods. We demonstrate that CloneSeeker is more accurate than existing algorithms at determining the number of clones, distribution of cancer cells among clones, and mutation and/or copy numbers belonging to each clone. Next, we applied CloneSeeker to SNP array data from samples of 258 previously untreated CLL patients to gain a better understanding of the characteristics of CLL tumors and to elucidate the relationship between clonal heterogeneity and clinical outcome. We found that a significant majority of CLL patients appear to have multiple clones distinguished by copy number alterations alone. We also found that the presence of multiple clones corresponded with significantly worse survival among CLL patients. These findings may prove useful for improving the accuracy of prognosis and design of treatment strategies. Availability and implementation Code available on R-Forge: https://r-forge.r-project.org/projects/CloneSeeker/ Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (3) ◽  
pp. 713-720 ◽  
Author(s):  
Mary A Wood ◽  
Austin Nguyen ◽  
Adam J Struck ◽  
Kyle Ellrott ◽  
Abhinav Nellore ◽  
...  

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 78 (09) ◽  
pp. 866-870 ◽  
Author(s):  
Marlena Fejzo ◽  
Daria Arzy ◽  
Rayna Tian ◽  
Kimber MacGibbon ◽  
Patrick Mullin

Abstract Introduction Hyperemesis gravidarum (HG), a pregnancy complication characterized by severe nausea and vomiting in pregnancy, occurs in up to 2% of pregnancies. It is associated with both maternal and fetal morbidity. HG is highly heritable and recurs in approximately 80% of women. In a recent genome-wide association study, it was shown that placentation, appetite, and the cachexia gene GDF15 are linked to HG. The purpose of this study was to explore whether GDF15 alleles linked to overexpression of GDF15 protein segregate with the condition in families, and whether the GDF15 risk allele is associated with recurrence of HG. Methods We analyzed GDF15 overexpression alleles for segregation with disease using exome-sequencing data from 5 HG families. We compared the allele frequency of the GDF15 risk allele, rs16982345, in patients who had recurrence of HG with its frequency in those who did not have recurrence. Results Single nucleotide polymorphisms (SNPs) linked to higher levels of GDF15 segregated with disease in HG families. The GDF15 risk allele, rs16982345, was associated with an 8-fold higher risk of recurrence of HG. Conclusion The findings of this study support the hypothesis that GDF15 is involved in the pathogenesis of both familial and recurrent cases of HG. The findings may be applicable when counseling women with a familial history of HG or recurrent HG. The GDF15-GFRAL brainstem-activated pathway was recently identified and therapies to treat conditions of abnormal appetite are under development. Based on our findings, patients carrying GDF15 variants associated with GDF15 overexpression should be included in future studies of GDF15-GFRAL-based therapeutics. If safe, this approach could reduce maternal and fetal morbidity.


2021 ◽  
Author(s):  
Aaron Wing Cheung Kwok ◽  
Chen Qiao ◽  
Rongting Huang ◽  
Mai-Har Sham ◽  
Joshua W. K. Ho ◽  
...  

AbstractMitochondrial mutations are increasingly recognised as informative endogenous genetic markers that can be used to reconstruct cellular clonal structure using single-cell RNA or DNA sequencing data. However, there is a lack of effective computational methods to identify informative mtDNA variants in noisy and sparse single-cell sequencing data. Here we present an open source computational tool MQuad that accurately calls clonally informative mtDNA variants in a population of single cells, and an analysis suite for complete clonality inference, based on single cell RNA or DNA sequencing data. Through a variety of simulated and experimental single cell sequencing data, we showed that MQuad can identify mitochondrial variants with both high sensitivity and specificity, outperforming existing methods by a large extent. Furthermore, we demonstrated its wide applicability in different single cell sequencing protocols, particularly in complementing single-nucleotide and copy-number variations to extract finer clonal resolution. MQuad is a Python package available via https://github.com/single-cell-genetics/MQuad.


2019 ◽  
Author(s):  
Sierra S Nishizaki ◽  
Natalie Ng ◽  
Shengcheng Dong ◽  
Robert S Porter ◽  
Cody Morterud ◽  
...  

Abstract Motivation Genome-wide association studies have revealed that 88% of disease-associated single-nucleotide polymorphisms (SNPs) reside in noncoding regions. However, noncoding SNPs remain understudied, partly because they are challenging to prioritize for experimental validation. To address this deficiency, we developed the SNP effect matrix pipeline (SEMpl). Results SEMpl estimates transcription factor-binding affinity by observing differences in chromatin immunoprecipitation followed by deep sequencing signal intensity for SNPs within functional transcription factor-binding sites (TFBSs) genome-wide. By cataloging the effects of every possible mutation within the TFBS motif, SEMpl can predict the consequences of SNPs to transcription factor binding. This knowledge can be used to identify potential disease-causing regulatory loci. Availability and implementation SEMpl is available from https://github.com/Boyle-Lab/SEM_CPP. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document