scholarly journals polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids

2018 ◽  
Author(s):  
Lindsay V. Clark ◽  
Alexander E. Lipka ◽  
Erik J. Sacks

AbstractLow or uneven read depth is a common limitation of genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), resulting in high missing data rates, heterozygotes miscalled as homozygotes, and uncertainty of allele copy number in heterozygous polyploids. Bayesian genotype calling can mitigate these issues, but previously has only been implemented in software that requires a reference genome or uses priors that may be inappropriate for the population. Here we present several novel Bayesian algorithms that estimate genotype posterior probabilities, all of which are implemented in a new R package, polyRAD. Appropriate priors can be specified for mapping populations, populations in Hardy-Weinberg equilibrium, or structured populations, and in each case can be informed by genotypes at linked markers. The polyRAD software imports read depth from several existing pipelines, and outputs continuous or discrete numerical genotypes suitable for analyses such as genome-wide association and genomic prediction.

2020 ◽  
Author(s):  
Lindsay V. Clark ◽  
Wittney Mays ◽  
Alexander E. Lipka ◽  
Erik J. Sacks

AbstractGiven the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for accurate genotyping methodology that distinguishes paralogs in order to yield Mendelian markers. Methods such as comparing observed and expected heterozygosity are frequently used for identifying collapsed paralogs, but have limitations in genotyping-by-sequencing datasets, in which observed heterozygosity is difficult to estimate due to undersampling of alleles. These limitations are especially pronounced when the species is highly heterozygous or the expected inheritance is polysomic. We introduce a novel statistic, Hind/HE, that uses the probability of sampling reads of two different alleles at a sample*locus, instead of observed heterozygosity. The expected value of Hind/HE is the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. We also introduce an algorithm that can choose among multiple alignment locations for a given sequence tag in order to optimize the value of Hind/HE for each locus, correcting alignment errors that frequently occur in highly duplicated genomes. Our methodology is implemented in polyRAD v1.2, available at https://github.com/lvclark/polyRAD.


Genes ◽  
2021 ◽  
Vol 12 (7) ◽  
pp. 1074
Author(s):  
Joanna Grzegorczyk ◽  
Artur Gurgul ◽  
Maria Oczkowicz ◽  
Tomasz Szmatoła ◽  
Agnieszka Fornal ◽  
...  

Poland is the largest European producer of goose, while goose breeding has become an essential and still increasing branch of the poultry industry. The most frequently bred goose is the White Kołuda® breed, constituting 95% of the country’s population, whereas geese of regional varieties are bred in smaller, conservation flocks. However, a goose’s genetic diversity is inaccurately explored, mainly because the advantages of the most commonly used tools are strongly limited in non-model organisms. One of the most accurate used markers for population genetics is single nucleotide polymorphisms (SNP). A highly efficient strategy for genome-wide SNP detection is genotyping-by-sequencing (GBS), which has been already widely applied in many organisms. This study attempts to use GBS in 12 conservative goose breeds and the White Kołuda® breed maintained in Poland. The GBS method allowed for the detection of 3833 common raw SNPs. Nevertheless, after filtering for read depth and alleles characters, we obtained the final markers panel used for a differentiation analysis that comprised 791 SNPs. These variants were located within 11 different genes, and one of the most diversified variants was associated with the EDAR gene, which is especially interesting as it participates in the plumage development, which plays a crucial role in goose breeding.


2018 ◽  
Author(s):  
Maziyar Baran Pouyan ◽  
Dennis Kostka

AbstractMotivationGenome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore obtaining accurate cell–cell similarities from scRNA-seq data is critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.ResultsHere we present RAFSIL, a random forest based approach to learn cell–cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization, and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.Availability and ImplementationThe RAFSIL R package is available online at www.kostkalab.net/software.html


Author(s):  
Alexis Hardy ◽  
Mélody Matelot ◽  
Amandine Touzeau ◽  
Christophe Klopp ◽  
Céline Lopez-Roques ◽  
...  

Abstract Motivation Long-read sequencing technologies can be employed to detect and map DNA modifications at the nucleotide resolution on a genome-wide scale. However, published software packages neglect the integration of genomic annotation and comprehensive filtering when analyzing patterns of modified bases detected using Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data. Here, we present DNAModAnnot, a R package designed for the global analysis of DNA modification patterns using adapted filtering and visualization tools. Results We tested our package using PacBio sequencing data to analyze patterns of the 6-methyladenine (6 mA) in the ciliate Paramecium tetraurelia, in which high 6 mA amounts were previously reported. We found Paramecium tetraurelia 6 mA genome-wide distribution to be similar to other ciliates. We also performed 5-methylcytosine (5mC) analysis in human lymphoblastoid cells using ONT data and confirmed previously known patterns of 5mC. DNAModAnnot provides a toolbox for the genome-wide analysis of different DNA modifications using PacBio and ONT long-read sequencing data. Availability DNAModAnnot is distributed as a R package available via GitHub (https://github.com/AlexisHardy/DNAModAnnot) Supplementary information Supplementary data are available at Bioinformatics online.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3582 ◽  
Author(s):  
Jes Johannesen ◽  
Armin G. Fabritzek ◽  
Bettina Ebner ◽  
Sven-Ernö Bikar

Phylogeographic analyses of the gall fly Urophora cardui have in earlier studies based on allozymes and mtDNA identified small-scale, parapatrically diverged populations within an expanding Western Palearctic population. However, the low polymorphism of these markers prohibited an accurate delimitation of the evolutionary origin of the parapatric divergence. Urophora cardui from the Western Palearctic have been introduced into Canada as biological control agents of the host plant Cirsium arvense. Here, we characterise 12 microsatellite loci with hexa-, penta- and tetra-nucleotide repeat motifs and report a genotyping-by-sequencing SNP protocol. We test the markers for genetic variation among three parapatric U. cardui populations. Microsatellite variability (N = 59 individuals) was high: expected heterozygosity/locus/population (0.60–0.90), allele number/locus/population (5–21). One locus was alternatively sex-linked in males or females. Cross-species amplification in the sister species U. stylata was successful or partially successful for seven loci. For genotyping-by-sequencing (N = 18 individuals), different DNA extraction methods did not affect data quality. Depending on sequence sorting criteria, 1,177–2,347 unlinked SNPs and 1,750–4,469 parsimony informative sites were found in 3,514–5,767 loci recovered after paralog filtering. Both marker systems quantified the same population partitions with high probabilities. Many and highly differentiated loci in both marker systems indicate genome-wide diversification and genetically distinct populations.


2019 ◽  
Vol 12 (S9) ◽  
Author(s):  
Rui Sun ◽  
Xiaoxuan Xia ◽  
Ka Chun Chong ◽  
Benny Chung-Ying Zee ◽  
William Ka Kei Wu ◽  
...  

Abstract Background With the increasing amount of high-throughput genomic sequencing data, there is a growing demand for a robust and flexible tool to perform interaction analysis. The identification of SNP-SNP, SNP-CpG, and higher order interactions helps explain the genetic etiology of human diseases, yet genome-wide analysis for interactions has been very challenging, due to the computational burden and a lack of statistical power in most datasets. Results The wtest R package performs association testing for main effects, pairwise and high order interactions in genome-wide association study data, and cis-regulation of SNP and CpG sites in genome-wide and epigenome-wide data. The software includes a number of post-test diagnostic and analysis functions and offers an integrated toolset for genetic epistasis testing. Conclusions The wtest is an efficient and powerful statistical tool for integrated genetic epistasis testing. The package is available in CRAN: https://CRAN.R-project.org/package=wtest.


2020 ◽  
Author(s):  
Gabriel Jimenez-Dominguez ◽  
Patrice Ravel ◽  
Stéphan Jalaguier ◽  
Vincent Cavaillès ◽  
Jacques Colinge

AbstractModular response analysis (MRA) is a widely used modeling technique to uncover coupling strengths in molecular networks under a steady-state condition by means of perturbation experiments. We propose an extension of this methodology to search genomic data for new associations with a network modeled by MRA and to improve the predictive accuracy of MRA models. These extensions are illustrated by exploring the cross talk between estrogen and retinoic acid receptors, two nuclear receptors implicated in several hormone-driven cancers such as breast. We also present a novel, rigorous and elegant mathematical derivation of MRA equations, which is the foundation of this work and of an R package that is freely available at https://github.com/bioinfo-ircm/aiMeRA/. This mathematical analysis should facilitate MRA understanding by newcomers.Author summaryEstrogen and retinoic acid receptors play an important role in several hormone-driven cancers and share co-regulators and co-repressors that modulate their transcription factor activity. The literature shows evidence for crosstalk between these two receptors and suggests that spatial competition on the promoters could be a mechanism. We used MRA to explore the possibility that key co-repressors, i.e., NRIP1 (RIP140) and LCoR could also mediate crosstalk by exploiting new quantitative (qPCR) and RNA sequencing data. The transcription factor role of the receptors and the availability of genome-wide data enabled us to explore extensions of the MRA methodology to explore genome-wide data sets a posteriori, searching for genes associated with a molecular network that was sampled by perturbation experiments. Despite nearly two decades of use, we felt that MRA lacked a systematic mathematical derivation. We present here an elegant and rather simple analysis that should greatly facilitate newcomers’ understanding of MRA details. Moreover, an easy-to-use R package is released that should make MRA accessible to biology labs without mathematical expertise. Quantitative data are embedded in the R package and RNA sequencing data are available from GEO.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Robert P. Adelson ◽  
Alan E. Renton ◽  
Wentian Li ◽  
Nir Barzilai ◽  
Gil Atzmon ◽  
...  

Abstract The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.


Sign in / Sign up

Export Citation Format

Share Document