scholarly journals hapCon: Estimating contamination of ancient genomes by copying from reference haplotypes

2021 ◽  
Author(s):  
Yilei Huang ◽  
Harald Ringbauer

Human ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1x average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1x and below) remains a challenging task for existing approaches. We present a new method to estimate contamination in aDNA for male individuals. It utilizes a Li&Stephen's haplotype copying model for haploid X chromosomes, with mismatches modelled as genotyping error or contamination. We assessed an implementation of this new approach, hapCon, on simulated and down-sampled empirical aDNA data. Our results demonstrate that hapCon outperforms a commonly used tool for estimating male X contamination (ANGSD), with substantially lower variance and narrower confidence intervals, especially in the low coverage regime. We found that hapCon provides useful contamination estimates for coverages as low as 0.1x for SNP capture data (1240k) and 0.02x for whole genome sequencing data (WGS), substantially extending the coverage limit of previous male X chromosome based contamination estimation methods.

2021 ◽  
Author(s):  
Zhen Wang ◽  
Zhenyang Zhang ◽  
Zitao Chen ◽  
Jiabao Sun ◽  
Caiyun Cao ◽  
...  

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.


2015 ◽  
Author(s):  
Rudy Arthur ◽  
Jared O'Connell ◽  
Ole Schulz-Trieglaff ◽  
Anthony J Cox

Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD) based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed, it is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low-coverage and high-coverage samples.


Author(s):  
Alicia R. Martin ◽  
Elizabeth G. Atkinson ◽  
Sinéad B. Chapman ◽  
Anne Stevenson ◽  
Rocky E. Stroud ◽  
...  

AbstractBackgroundGenetic studies of biomedical phenotypes in underrepresented populations identify disproportionate numbers of novel associations. However, current genomics infrastructure--including most genotyping arrays and sequenced reference panels--best serves populations of European descent. A critical step for facilitating genetic studies in underrepresented populations is to ensure that genetic technologies accurately capture variation in all populations. Here, we quantify the accuracy of low-coverage sequencing in diverse African populations.ResultsWe sequenced the whole genomes of 91 individuals to high-coverage (≥20X) from the Neuropsychiatric Genetics of African Population-Psychosis (NeuroGAP-Psychosis) study, in which participants were recruited from Ethiopia, Kenya, South Africa, and Uganda. We empirically tested two data generation strategies, GWAS arrays versus low-coverage sequencing, by calculating the concordance of imputed variants from these technologies with those from deep whole genome sequencing data. We show that low-coverage sequencing at a depth of ≥4X captures variants of all frequencies more accurately than all commonly used GWAS arrays investigated and at a comparable cost. Lower depths of sequencing (0.5-1X) performed comparable to commonly used low-density GWAS arrays. Low-coverage sequencing is also sensitive to novel variation, with 4X sequencing detecting 45% of singletons and 95% of common variants identified in high-coverage African whole genomes.ConclusionThese results indicate that low-coverage sequencing approaches surmount the problems induced by the ascertainment of common genotyping arrays, including those that capture variation most common in Europeans and Africans. Low-coverage sequencing effectively identifies novel variation (particularly in underrepresented populations), and presents opportunities to enhance variant discovery at a similar cost to traditional approaches.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Johannes Smolander ◽  
Sofia Khan ◽  
Kalaimathy Singaravelu ◽  
Leni Kauko ◽  
Riikka J. Lund ◽  
...  

Abstract Background Detection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection. Result Here, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method. Conclusions Our comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.


2017 ◽  
Author(s):  
A.N. Blackburn ◽  
M.Z. Kos ◽  
N.B. Blackburn ◽  
J.M. Peralta ◽  
P. Stevens ◽  
...  

AbstractPhasing, the process of predicting haplotypes from genotype data, is an important undertaking in genetics and an ongoing area of research. Phasing methods, and associated software, designed specifically for pedigrees are urgently needed. Here we present a new method for phasing genotypes from whole genome sequencing data in pedigrees: PULSAR (Phasing Using Lineage Specific Alleles / Rare variants). The method is built upon the idea that alleles that are specific to a single founding chromosome within a pedigree, which we refer to as lineage-specific alleles, are highly informative for identifying haplotypes that are identical-by-decent between individuals within a pedigree. Through extensive simulation we assess the performance of PULSAR in a variety of pedigree sizes and structures, and we explore the effects of genotyping errors and presence of non-sequenced individuals on its performance. If the genotyping error rate is sufficiently low PULSAR can phase > 99.9% of heterozygous genotypes with a switch error rate below 1 x 10-4 in pedigrees where all individuals are sequenced. We demonstrate that the method is highly accurate and consistently outperforms the long-range phasing approach used for comparison in our benchmarking. The method also holds promise for fixing genotype errors or imputing missing genotypes. The software implementation of this method is freely available.


2017 ◽  
Author(s):  
Nagarajan Paramasivam ◽  
Martin Granzow ◽  
Christina Evers ◽  
Katrin Hinderhofer ◽  
Stefan Wiemann ◽  
...  

AbstractWith genome sequencing entering the clinics as diagnostic tool to study genetic disorders, there is an increasing need for bioinformatics solutions that enable precise causal variant identification in a timely manner.BackgroundWorkflows for the identification of candidate disease-causing variants perform usually the following tasks: i) identification of variants; ii) filtering of variants to remove polymorphisms and technical artifacts; and iii) prioritization of the remaining variants to provide a small set of candidates for further analysis.MethodsHere, we present a pipeline designed to identify variants and prioritize the variants and genes from trio sequencing or pedigree-based sequencing data into different tiers.ResultsWe show how this pipeline was applied in a study of patients with neurodevelopmental disorders of unknown cause, where it helped to identify the causal variants in more than 35% of the cases.ConclusionsClassification and prioritization of variants into different tiers helps to select a small set of variants for downstream analysis.


Sign in / Sign up

Export Citation Format

Share Document