hapCon: Estimating contamination of ancient genomes by copying from reference haplotypes

Mapping Intimacies ◽

10.1101/2021.12.20.473429 ◽

2021 ◽

Author(s):

Yilei Huang ◽

Harald Ringbauer

Keyword(s):

Estimation Methods ◽

Whole Genome Sequencing Data ◽

Genotyping Error ◽

Sequencing Data ◽

X Chromosomes ◽

New Approach ◽

Human Dna ◽

Downstream Analysis ◽

Low Coverage ◽

Rule Out

Human ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1x average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1x and below) remains a challenging task for existing approaches. We present a new method to estimate contamination in aDNA for male individuals. It utilizes a Li&Stephen's haplotype copying model for haploid X chromosomes, with mismatches modelled as genotyping error or contamination. We assessed an implementation of this new approach, hapCon, on simulated and down-sampled empirical aDNA data. Our results demonstrate that hapCon outperforms a commonly used tool for estimating male X contamination (ANGSD), with substantially lower variance and narrower confidence intervals, especially in the low coverage regime. We found that hapCon provides useful contamination estimates for coverages as low as 0.1x for SNP capture data (1240k) and 0.02x for whole genome sequencing data (WGS), substantially extending the coverage limit of previous male X chromosome based contamination estimation methods.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

IEEE Access ◽

10.1109/access.2020.2971863 ◽

2020 ◽

Vol 8 ◽

pp. 27973-27985

Author(s):

Yaoyao Li ◽

Junying Zhang ◽

Xiguo Yuan ◽

Junping Li

Keyword(s):

Genome Sequencing ◽

Dirichlet Process ◽

Copy Number ◽

Gaussian Mixture ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Low Coverage ◽

Copy Number Variation Detection

Download Full-text

Application of risk score analysis to low-coverage whole genome sequencing data for the noninvasive detection of trisomy 21, trisomy 18, and trisomy 13

Prenatal Diagnosis ◽

10.1002/pd.4712 ◽

2015 ◽

Vol 36 (1) ◽

pp. 56-62 ◽

Cited By ~ 8

Author(s):

J. A. Tynan ◽

S. K. Kim ◽

A. R. Mazloom ◽

C. Zhao ◽

G. McLennan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Trisomy 21 ◽

Trisomy 13 ◽

Noninvasive Detection ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Score Analysis ◽

Low Coverage

Download Full-text

PHARP: A pig haplotype reference panel for genotype imputation

10.1101/2021.06.03.446888 ◽

2021 ◽

Author(s):

Zhen Wang ◽

Zhenyang Zhang ◽

Zitao Chen ◽

Jiabao Sun ◽

Caiyun Cao ◽

...

Keyword(s):

Complex Traits ◽

Sequence Data ◽

Genotype Imputation ◽

Reference Panel ◽

Whole Genome Sequence ◽

Sequencing Data ◽

Large White ◽

Downstream Analysis ◽

Low Coverage ◽

Analytical Tools

Pigs not only function as a major meat source worldwide but also are commonly used as an animal model for studying human complex traits. A large haplotype reference panel has been used to facilitate efficient phasing and imputation of relatively sparse genome-wide microarray chips and low-coverage sequencing data. Using the imputed genotypes in the downstream analysis, such as GWASs, TWASs, eQTL mapping and genomic prediction (GS), is beneficial for obtaining novel findings. However, currently, there is still a lack of publicly available and high-quality pig reference panels with large sample sizes and high diversity, which greatly limits the application of genotype imputation in pigs. In response, we built the pig Haplotype Reference Panel (PHARP) database. PHARP provides a reference panel of 2,012 pig haplotypes at 34 million SNPs constructed using whole-genome sequence data from more than 49 studies of 71 pig breeds. It also provides Web-based analytical tools that allow researchers to carry out phasing and imputation consistently and efficiently. PHARP is freely accessible at http://alphaindex.zju.edu.cn/PHARP/index.php. We demonstrate its applicability for pig commercial 50K SNP arrays, by accurately imputing 2.6 billion genotypes at a concordance rate value of 0.971 in 81 Large White pigs (~ 17x sequencing coverage). We also applied our reference panel to impute the low-density SNP chip into the high-density data for three GWASs and found novel significantly associated SNPs that might be casual variants.

Download Full-text

Rapid Genotype Refinement for Whole-Genome Sequencing Data using Multi-Variate Normal Distributions

10.1101/031484 ◽

2015 ◽

Author(s):

Rudy Arthur ◽

Jared O'Connell ◽

Ole Schulz-Trieglaff ◽

Anthony J Cox

Keyword(s):

Markov Models ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

High Coverage ◽

Multivariate Gaussian Distribution ◽

Data Set ◽

Normal Distributions ◽

Computationally Expensive ◽

Low Coverage

Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD) based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed, it is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low-coverage and high-coverage samples.

Download Full-text

Lep-MAP3: robust linkage mapping even for low-coverage whole genome sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btx494 ◽

2017 ◽

Vol 33 (23) ◽

pp. 3726-3732 ◽

Cited By ~ 85

Author(s):

Pasi Rastas

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Linkage Mapping ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Low Coverage

Download Full-text

Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations

10.1101/2020.04.27.064832 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alicia R. Martin ◽

Elizabeth G. Atkinson ◽

Sinéad B. Chapman ◽

Anne Stevenson ◽

Rocky E. Stroud ◽

...

Keyword(s):

Whole Genome Sequencing Data ◽

Data Generation ◽

Sequencing Data ◽

Underrepresented Populations ◽

High Coverage ◽

Genetic Studies ◽

Variant Discovery ◽

Whole Genomes ◽

Low Coverage ◽

Novel Variation

AbstractBackgroundGenetic studies of biomedical phenotypes in underrepresented populations identify disproportionate numbers of novel associations. However, current genomics infrastructure--including most genotyping arrays and sequenced reference panels--best serves populations of European descent. A critical step for facilitating genetic studies in underrepresented populations is to ensure that genetic technologies accurately capture variation in all populations. Here, we quantify the accuracy of low-coverage sequencing in diverse African populations.ResultsWe sequenced the whole genomes of 91 individuals to high-coverage (≥20X) from the Neuropsychiatric Genetics of African Population-Psychosis (NeuroGAP-Psychosis) study, in which participants were recruited from Ethiopia, Kenya, South Africa, and Uganda. We empirically tested two data generation strategies, GWAS arrays versus low-coverage sequencing, by calculating the concordance of imputed variants from these technologies with those from deep whole genome sequencing data. We show that low-coverage sequencing at a depth of ≥4X captures variants of all frequencies more accurately than all commonly used GWAS arrays investigated and at a comparable cost. Lower depths of sequencing (0.5-1X) performed comparable to commonly used low-density GWAS arrays. Low-coverage sequencing is also sensitive to novel variation, with 4X sequencing detecting 45% of singletons and 95% of common variants identified in high-coverage African whole genomes.ConclusionThese results indicate that low-coverage sequencing approaches surmount the problems induced by the ascertainment of common genotyping arrays, including those that capture variation most common in Europeans and Africans. Low-coverage sequencing effectively identifies novel variation (particularly in underrepresented populations), and presents opportunities to enhance variant discovery at a similar cost to traditional approaches.

Download Full-text

Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data

BMC Genomics ◽

10.1186/s12864-021-07686-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Johannes Smolander ◽

Sofia Khan ◽

Kalaimathy Singaravelu ◽

Leni Kauko ◽

Riikka J. Lund ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sex Chromosomes ◽

Copy Number ◽

Copy Number Variations ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Low Coverage ◽

Cnv Detection

Abstract Background Detection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection. Result Here, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method. Conclusions Our comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.

Download Full-text

Accurate Phasing of Pedigree Genotypes Using Whole Genome Sequence Data

10.1101/148510 ◽

2017 ◽

Author(s):

A.N. Blackburn ◽

M.Z. Kos ◽

N.B. Blackburn ◽

J.M. Peralta ◽

P. Stevens ◽

...

Keyword(s):

Error Rate ◽

Sequence Data ◽

Software Implementation ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Genotype Data ◽

Whole Genome ◽

Genotyping Error ◽

Sequencing Data ◽

Missing Genotypes

AbstractPhasing, the process of predicting haplotypes from genotype data, is an important undertaking in genetics and an ongoing area of research. Phasing methods, and associated software, designed specifically for pedigrees are urgently needed. Here we present a new method for phasing genotypes from whole genome sequencing data in pedigrees: PULSAR (Phasing Using Lineage Specific Alleles / Rare variants). The method is built upon the idea that alleles that are specific to a single founding chromosome within a pedigree, which we refer to as lineage-specific alleles, are highly informative for identifying haplotypes that are identical-by-decent between individuals within a pedigree. Through extensive simulation we assess the performance of PULSAR in a variety of pedigree sizes and structures, and we explore the effects of genotyping errors and presence of non-sequenced individuals on its performance. If the genotyping error rate is sufficiently low PULSAR can phase > 99.9% of heterozygous genotypes with a switch error rate below 1 x 10-4 in pedigrees where all individuals are sequenced. We demonstrate that the method is highly accurate and consistently outperforms the long-range phasing approach used for comparison in our benchmarking. The method also holds promise for fixing genotype errors or imputing missing genotypes. The software implementation of this method is freely available.

Download Full-text

Identification and prioritisation of causal variants in human genetic disorders from exome or whole genome sequencing data

10.1101/209882 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nagarajan Paramasivam ◽

Martin Granzow ◽

Christina Evers ◽

Katrin Hinderhofer ◽

Stefan Wiemann ◽

...

Keyword(s):

Genome Sequencing ◽

Genetic Disorders ◽

Causal Variant ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Causal Variants ◽

Small Set ◽

Downstream Analysis ◽

Human Genetic Disorders ◽

Variant Identification

AbstractWith genome sequencing entering the clinics as diagnostic tool to study genetic disorders, there is an increasing need for bioinformatics solutions that enable precise causal variant identification in a timely manner.BackgroundWorkflows for the identification of candidate disease-causing variants perform usually the following tasks: i) identification of variants; ii) filtering of variants to remove polymorphisms and technical artifacts; and iii) prioritization of the remaining variants to provide a small set of candidates for further analysis.MethodsHere, we present a pipeline designed to identify variants and prioritize the variants and genes from trio sequencing or pedigree-based sequencing data into different tiers.ResultsWe show how this pipeline was applied in a study of patients with neurodevelopmental disorders of unknown cause, where it helped to identify the causal variants in more than 35% of the cases.ConclusionsClassification and prioritization of variants into different tiers helps to select a small set of variants for downstream analysis.

Download Full-text