Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data

Ryan K. Waples; Anders Albrechtsen; Ida Moltke

doi:10.1111/mec.14954

Allele frequency-free inference of close familial relationships from genotypes or low depth sequencing data

10.1101/260497 ◽

2018 ◽

Author(s):

Ryan K Waples ◽

Anders Albrechtsen ◽

Ida Moltke

Keyword(s):

Allele Frequency ◽

Allele Frequencies ◽

Model Organisms ◽

Human Populations ◽

Genotype Data ◽

Sequencing Data ◽

Diverse Range ◽

Genomic Position ◽

Familial Relationships ◽

Similar Accuracy

AbstractKnowledge of how individuals are related is important in many areas of research and numerous methods for inferring pairwise relatedness from genetic data have been developed. However, the majority of these methods were not developed for situations where data is limited. Specifically, most methods rely on the availability of population allele frequencies, the relative genomic position of variants, and accurate genotype data. But in studies of non-model organisms or ancient human samples, such data is not always available. Motivated by this, we present a new method for pairwise relatedness inference, which requires neither allele frequency information nor information on genomic position. Furthermore, it can be applied to both genotype data and to low-depth sequencing data where genotypes cannot be accurately called. We evaluate it using data from SNP arrays and low-depth sequencing from a range of human populations and show that it can be used to infer close familial relationships with a similar accuracy as a widely used method that relies on population allele frequencies. Additionally, we show that our method is robust to SNP ascertainment, which is important for application to a diverse range of populations and species.

A simple method to estimate the in-house limit of detection for genetic mutations with low allele frequencies in whole-exome sequencing analysis by next-generation sequencing

BMC Genomic Data ◽

10.1186/s12863-020-00956-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Takumi Miura ◽

Satoshi Yasuda ◽

Yoji Sato

Keyword(s):

Next Generation Sequencing ◽

Allele Frequency ◽

Somatic Mutations ◽

Limit Of Detection ◽

Allele Frequencies ◽

Genetic Mutations ◽

Sequencing Data ◽

Simple Method ◽

Whole Exome ◽

Generation Sequencing

Abstract Background Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency. Results Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%. Conclusions For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.

Accurate allele frequencies from ultra-low coverage pool-seq samples in evolve-and-resequence experiments

10.1101/244004 ◽

2018 ◽

Author(s):

Susanne Tilk ◽

Alan Bergland ◽

Aaron Goodman ◽

Paul Schmidt ◽

Dmitri Petrov ◽

...

Keyword(s):

Allele Frequency ◽

Model Organism ◽

Software Tool ◽

Allele Frequencies ◽

Model Organisms ◽

Sequencing Data ◽

High Coverage ◽

Next Generation Sequencing Technology ◽

Low Coverage ◽

Pooled Samples

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.

SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data

PLoS ONE ◽

10.1371/journal.pone.0037558 ◽

2012 ◽

Vol 7 (7) ◽

pp. e37558 ◽

Cited By ~ 212

Author(s):

Rasmus Nielsen ◽

Thorfinn Korneliussen ◽

Anders Albrechtsen ◽

Yingrui Li ◽

Jun Wang

Keyword(s):

Allele Frequency ◽

Frequency Estimation ◽

Sequencing Data ◽

Genotype Calling ◽

Snp Calling ◽

Allele Frequency Estimation ◽

New Generation Sequencing ◽

New Generation ◽

Generation Sequencing

Efficient approaches for large scale GWAS studies with genotype uncertainty

10.1101/786384 ◽

2019 ◽

Author(s):

Emil Jørsboe ◽

Anders Albrechtsen

Keyword(s):

Population Structure ◽

Allele Frequency ◽

Statistical Power ◽

Large Scale ◽

Association Studies ◽

Genetic Data ◽

Data Sets ◽

Sequencing Data ◽

The Individual ◽

Genotype Probabilities

1AbstractIntroductionAssociation studies using genetic data from SNP-chip based imputation or low depth sequencing data provide a cost efficient design for large scale studies. However, these approaches provide genetic data with uncertainty of the observed genotypes. Here we explore association methods that can be applied to data where the genotype is not directly observed. We investigate how using different priors when estimating genotype probabilities affects the association results in different scenarios such as studies with population structure and varying depth sequencing data. We also suggest a method (ANGSD-asso) that is computational feasible for analysing large scale low depth sequencing data sets, such as can be generated by the non-invasive prenatal testing (NIPT) with low-pass sequencing.MethodsANGSD-asso’s EM model works by modelling the unobserved genotype as a latent variable in a generalised linear model framework. The software is implemented in C/C++ and can be run multi-threaded enabling the analysis of big data sets. ANGSD-asso is based on genotype probabilities, they can be estimated in various ways, such as using the sample allele frequency as a prior, using the individual allele frequencies as a prior or using haplotype frequencies from haplotype imputation. Using simulations of sequencing data we explore how genotype probability based method compares to using genetic dosages in large association studies with genotype uncertainty.Results & DiscussionOur simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. If there is a correlation between genotype uncertainty and phenotype, then the individual allele frequency prior also helps control the false positive rate. In the absence of population structure the sample allele frequency prior and the individual allele frequency prior perform similarly. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s EM model has better statistical power and less bias compared to using dosages. Lastly when adding additional covariates to the linear model ANGSD-asso’s EM model has more statistical power and provides less biased effect sizes than other methods that accommodate genotype uncertainly, while also being much faster. This makes it possible to properly account for genotype uncertainty in large scale association studies.

The theory and practice of measuring broad-range recombination rate from marker selected pools

10.1101/762575 ◽

2019 ◽

Author(s):

Kevin H.-C. Wei ◽

Aditya Mantha ◽

Doris Bachtrog

Keyword(s):

Genetic Distance ◽

Allele Frequency ◽

Recombination Rate ◽

High Throughput Sequencing ◽

Genetic Material ◽

Cost Effective ◽

Theory And Practice ◽

Rate Variation ◽

Sequencing Data ◽

Genome Wide

ABSTRACTRecombination is the exchange of genetic material between homologous chromosomes via physical crossovers. Pioneered by T. H. Morgan and A. Sturtevant over a century ago, methods to estimate recombination rate and genetic distance require scoring large number of recombinant individuals between molecular or visible markers. While high throughput sequencing methods have allowed for genome wide crossover detection producing high resolution maps, such methods rely on large number of recombinants individually sequenced and are therefore difficult to scale. Here, we present a simple and scalable method to infer near chromosome-wide recombination rate from marker selected pools and the corresponding analytical software MarSuPial. Rather than genotyping individuals from recombinant backcrosses, we bulk sequence marker selected pools to infer the allele frequency decay around the selected locus; since the number of recombinant individuals increases proportionally to the genetic distance from the selected locus, the allele frequency across the chromosome can be used to estimate the genetic distance and recombination rate. We mathematically demonstrate the relationship between allele frequency attenuation, recombinant fraction, genetic distance, and recombination rate in marker selected pools. Based on available chromosome-wide recombination rate models of Drosophila, we simulated read counts and determined that nonlinear local regressions (LOESS) produce robust estimates despite the high noise inherent to sequencing data. To empirically validate this approach, we show that (single) marker selected pools closely recapitulate genetic distances inferred from scoring recombinants between double markers. We theoretically determine how secondary loci with viability impacts can modulate the allele frequency decay and how to account for such effects directly from the data. We generated the recombinant map of three wild derived strains which strongly correlates with previous genome-wide measurements. Interestingly, amidst extensive recombination rate variation, multiple regions of the genomes show elevated rates across all strains. Lastly, we apply this method to estimate chromosome-wide crossover interference. Altogether, we find that marker selected pools is a simple and cost effective method for broad recombination rate estimates. Although it does not identify instances of crossovers, it can generate near chromosome-wide recombination maps in as little as one or two libraries.

Regarding the F-word: the effects of data Filtering on inferred genotype-environment associations

10.1101/2020.09.08.288308 ◽

2020 ◽

Cited By ~ 1

Author(s):

Collin W Ahrens ◽

Rebecca Jordan ◽

Jason Bragg ◽

Peter A Harrison ◽

Tara Hopley ◽

...

Keyword(s):

Missing Data ◽

Allele Frequency ◽

Minor Allele Frequency ◽

Management Strategies ◽

Minor Allele ◽

Sequencing Data ◽

System Availability ◽

Landscape Genomics ◽

Genotype By Sequencing ◽

Better Than

AbstractGenotype-environment association (GEA) methods have become part of the standard landscape genomics toolkit, yet, we know little about how to filter genotype-by-sequencing data to provide robust inferences for environmental adaptation. In many cases, default filtering thresholds for minor allele frequency and missing data are applied regardless of sample size, having unknown impacts on the results. These effects could be amplified in downstream predictions, including management strategies. Here, we investigate the effects of filtering on GEA results and the potential implications for adaptation to environment. Using empirical and simulated datasets derived from two widespread tree species to assess the effects of filtering on GEA outputs. Critically, we find that the level of filtering of missing data and minor allele frequency affect the identification of true positives. Even slight adjustments to these thresholds can change the rate of true positive detection. Using conservative thresholds for missing data and minor allele frequency substantially reduces the size of the dataset, lessening the power to detect adaptive variants (i.e. simulated true positives) with strong and weak strength of selections. Regardless, strength of selection was a good predictor for GEA detection, but even SNPs under strong selection went undetected. We further show that filtering can significantly impact the predictions of adaptive capacity of species in downstream analyses. We make several recommendations regarding filtering for GEA methods. Ultimately, there is no filtering panacea, but some choices are better than others, depending largely on the study system, availability of genomic resources, and desired objectives of the study.

Evaluation of Allele Frequency Estimation Using Pooled Sequencing Data Simulation

The Scientific World JOURNAL ◽

10.1155/2013/895496 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 9

Author(s):

Yan Guo ◽

David C. Samuels ◽

Jiang Li ◽

Travis Clark ◽

Chung-I Li ◽

...

Keyword(s):

Simulation Model ◽

Allele Frequency ◽

Minor Allele Frequency ◽

Error Rate ◽

Association Studies ◽

Minor Allele ◽

Pool Size ◽

Average Depth ◽

Sequencing Data ◽

Pooled Sequencing

Next-generation sequencing (NGS) technology has provided researchers with opportunities to study the genome in unprecedented detail. In particular, NGS is applied to disease association studies. Unlike genotyping chips, NGS is not limited to a fixed set of SNPs. Prices for NGS are now comparable to the SNP chip, although for large studies the cost can be substantial. Pooling techniques are often used to reduce the overall cost of large-scale studies. In this study, we designed a rigorous simulation model to test the practicability of estimating allele frequency from pooled sequencing data. We took crucial factors into consideration, including pool size, overall depth, average depth per sample, pooling variation, and sampling variation. We used real data to demonstrate and measure reference allele preference in DNAseq data and implemented this bias in our simulation model. We found that pooled sequencing data can introduce high levels of relative error rate (defined as error rate divided by targeted allele frequency) and that the error rate is more severe for low minor allele frequency SNPs than for high minor allele frequency SNPs. In order to overcome the error introduced by pooling, we recommend a large pool size and high average depth per sample.

unCOVERApp: an interactive graphical application for clinical assessment of sequence coverage at the base-pair level

10.1101/2020.02.10.939769 ◽

2020 ◽

Author(s):

Emanuela Iovino ◽

Marco Seri ◽

Tommaso Pippucci

Keyword(s):

Allele Frequency ◽

Clinical Assessment ◽

Base Pair ◽

Quality Parameters ◽

Sequencing Data ◽

Sequence Coverage ◽

Bioinformatic Tools ◽

Clinical Annotation ◽

Clinical Consequences ◽

Pair Level

AbstractMotivationNext Generation Sequencing (NGS) is increasingly adopted in the clinical practice largely thanks to concurrent advancements in bioinformatic tools for variant detection and annotation. Despite improvements in available approaches, the need to assess sequencing quality down to the base-pair level still poses challenges for diagnostic accuracy. One of the most popular quality parameters of diagnostic NGS is the percentage of targeted bases characterized by low depth of coverage (DoC). These regions potentially hide a clinically-relevant variant, but no annotation is usually returned for them.However, visualizing low-DoC data with their potential functional and clinical consequences may be useful to prioritize inspection of specific regions before re-sequencing all coverage gaps or making assertions about completeness of the diagnostic test.To meet this need we have developed unCOVERApp, an interactive application for graphical inspection and clinical annotation of low-DoC genomic regions containing genes.ResultsunCOVERApp is a suite of graphical and statistical tools to support clinical assessment of low-DoC regions. Its interactive plots allow to display gene sequence coverage down to the base-pair level, and functional and clinical annotations of sites below a user-defined DoC threshold can be downloaded in a user-friendly spreadsheet format. Moreover, unCOVERApp provides a simple statistical framework to evaluate if DoC is sufficient for the detection of somatic variants, where the usual 20x DoC threshold used for germline variants is not adequate. A maximum credible allele frequency calculator is also available allowing users to set allele frequency cut-offs based on assumptions about the genetic architecture of the disease instead of applying a general one (e.g. 5%). In conclusion, unCOVERApp is an original tool designed to identify sites of potential clinical interest that may be hidden in diagnostic sequencing data.AvailabilityunCOVERApp is a freely available application written in R and developed with Shiny packages and available in GitHub.

Conifer: Clonal Tree Inference for Tumor Heterogeneity With Single-cell and Bulk Sequencing Data

10.21203/rs.3.rs-263502/v1 ◽

2021 ◽

Author(s):

Leila Baghaarabani ◽

Sama Goliaei ◽

Mohammad-Hadi Foroughmand-Araabi ◽

Seyed Peyman Shariatpanahi ◽

Bahram Goliaei

Keyword(s):

Single Cell ◽

Allele Frequency ◽

Tumor Heterogeneity ◽

Variant Allele ◽

Evolutionary Relationships ◽

Sequencing Data ◽

Variant Allele Frequency ◽

Similar Frequency ◽

Single Cell Sequencing ◽

Tree Inference

Abstract Background: An important and effective step in cancer treatment is understanding the clonal evolution of cancer tumors. Clones are cell populations with different genotypes, resulting from the differences in the somatic mutations that occur and accumulate during cancer development. An appropriate approach for better understanding a tumor population is determining the variant allele frequency with which the mutation occurs in the entire population. Bulk sequencing data can be used to provide that information, but the frequencies are not informative enough in identifying different clones and their evolutionary relationships. On the other hand, single-cell sequencing data provides valuable information about branching events in the evolution of a cancerous tumor. However, in the single-cell sequencing data, the total population of sequenced cells is naturally much smaller than bulk sequencing so it is not precise enough for calculating cell prevalence.Result: In this study, a new method called Conifer (ClONal tree Inference For hEterogeneity of tumoR) is proposed which combines aggregated variant allele frequency from bulk sequencing data with branch evolution information from single-cell sequencing data, in order to better understand clones and their evolutionary relationships. It is proven that the accuracy of clone identification is increased by using Conifer compared to other existing methods in both real and simulated data. Also, it is shown that the approach of Conifer in using single-cell sequencing data together with bulk sequencing data has reduced the possibility of cloning mutations with similar frequency but belonging to different clones.Conclusions: In this study, we provided an accurate and robust method to identify clones of tumor heterogeneity and their evolutionary history by combining single-cell and bulk sequencing data.