scholarly journals Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

2015 ◽  
Author(s):  
Steven H Wu ◽  
Rachel S Schwartz ◽  
David J Winter ◽  
Don Conrad ◽  
Reed A Cartwright

Motivation: Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. Because de novo mutations are rare, even low levels of genotyping error can cause a large fraction of false positive de novo mutations. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error, and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity region. We expect that this approach to modeling the distribution of NGS data, will lead to improved genotyping. For example, this approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.

2016 ◽  
Author(s):  
Vladimir B. Seplyarskiy ◽  
Maria A. Andrianova ◽  
Georgii A. Bazykin

AbstractAPOBEC3A/B cytidine deaminase is responsible for the majority of cancerous mutations in a large fraction of cancer samples. However, its role in heritable mutagenesis remains very poorly understood. Recent studies have demonstrated that both in yeast and in human cancerous cells, most of APOBEC3A/B-induced mutations occur on the lagging strand during replication. Here, we use data on rare human polymorphisms, interspecies divergence, and de novo mutations to study germline mutagenesis, and analyze mutations at nucleotide contexts prone to attack by APOBEC3A/B. We show that such mutations occur preferentially on the lagging strand. Moreover, we demonstrate that APOBEC3A/B-like mutations tend to produce strand-coordinated clusters, which are also biased towards the lagging strand. Finally, we show that the mutation rate is increased 3’ of C→G mutations to a greater extent than 3’ of C→T mutations, suggesting pervasive translesion bypass of the APOBEC3A/B-induced damage. Our study demonstrates that 20% of C→T and C→G mutations segregating as polymorphisms in human population are attributable to APOBEC3A/B activity.


2016 ◽  
Author(s):  
Haeyoung Jeong ◽  
Jae-Goo Pan ◽  
Seung-Hwan Park

ABSTRACTThe nonhybrid hierarchical assembly of PacBio long reads is becoming the most preferred method for obtaining genomes for microbial isolates. On the other hand, among massive numbers of Illumina sequencing reads produced, there is a slim chance of re-evaluating failed microbial genome assembly (high contig number, large total contig size, and/or the presence of low-depth contigs). We generated Illumina-type test datasets with various levels of sequencing error, pretreatment (trimming and error correction), repetitive sequences, contamination, and ploidy from both simulated and real sequencing data and applied k-mer abundance analysis to quickly detect possible diagnostic signatures of poor assemblies. Contamination was the only factor leading to poor assemblies for the test dataset derived from haploid microbial genomes, resulting in an extraordinary peak within low-frequency k-mer range. When thirteen Illumina sequencing reads of microbes belonging to genera Bacillus or Paenibacillus from a single multiplexed run were subjected to a k-mer abundance analysis, all three samples leading to poor assemblies showed peculiar patterns of contamination. Read depth distribution along the contig length indicated that all problematic assemblies suffered from too many contigs with low average read coverage, where 1% to 15% of total reads were mapped to low-coverage contigs. We found that subsampling or filtering out reads having rare k-mers could efficiently remove low-level contaminants and greatly improve the de novo assemblies. An analysis of 16S rRNA genes recruited from reads or contigs and the application of read classification tools originally designed for metagenome analyses can help identify the source of a contamination. The unexpected presence of proteobacterial reads across multiple samples, which had no relevance to our lab environment, implies that such prevalent contamination might have occurred after the DNA preparation step, probably at the place where sequencing service was provided.


Sequencing ◽  
2010 ◽  
Vol 2010 ◽  
pp. 1-12 ◽  
Author(s):  
Alexander J. Nederbragt ◽  
Trine Ballestad Rounge ◽  
Kyrre L. Kausrud ◽  
Kjetill S. Jakobsen

Contigs assembled from 454 reads from bacterial genomes demonstrate a range of read depths, with a number of contigs having a depth that is far higher than can be expected. For reference genome sequence datasets, there exists a high correlation between the contig specific read depth and the number of copies present in the genome. We developed a sequence of applied statistical analyses, which suggest that the number of copies present can be reliably estimated based on the read depth distribution in de novo genome assemblies. Read depths of contigs of de novo cyanobacterial genome assemblies were determined, and several high read depth contigs were identified. These contigs were shown to mainly contain genes that are known to be present in multiple copies in bacterial genomes. For these assemblies, a correlation between read depth and copy number was experimentally demonstrated using real-time PCR. Copy number estimates, obtained using the statistical analysis developed in this work, are presented. Per-contig read depth analysis of assemblies based on 454 reads therefore enables de novo detection of genomic repeats and estimation of the copy number of these repeats. Additionally, our analysis efficiently identified contigs stemming from sample contamination, allowing for their removal from the assembly.


2016 ◽  
Author(s):  
Wenhan Chen ◽  
Alan J. Robertson ◽  
Devika Ganesamoorthy ◽  
Lachlan J.M. Coin

AbstractAccurate identification of copy number alterations is an essential step in understanding the events driving tumor progression. While a variety of algorithms have been developed to use high-throughput sequencing data to profile copy number changes, no tool is able to reliably characterize ploidy and genotype absolute copy number from tumor samples which contain less than 40% tumor cells. To increase our power to resolve the copy number profile from low-cellularity tumor samples, we developed a novel approach which pre-phases heterozygote germline SNPs in order to replace the commonly used ‘B-allele frequency’ with a more powerful ‘parental-haplotype frequency’. We apply our tool - sCNAphase - to characterize the copy number and loss-of-heterozygosity profiles of four publicly available breast cancer cell-lines. Comparisons to previous spectral karyotyping and microarray studies revealed that sCNAphase reliably identified overall ploidy as well as the individual copy number mutations from each cell-line. Analysis of artificial cell-line mixtures demonstrated the capacity of this method to determine the level of tumor cellularity, consistently identify sCNAs and characterize ploidy in samples with as little as 10% tumor cells. This novel methodology has the potential to bring sCNA profiling to low-cellularity tumors, a form of cancer unable to be accurately studied by current methods.


eLife ◽  
2017 ◽  
Vol 6 ◽  
Author(s):  
Chungang Feng ◽  
Mats Pettersson ◽  
Sangeet Lamichhaney ◽  
Carl-Johan Rubin ◽  
Nima Rafati ◽  
...  

The Atlantic herring is one of the most abundant vertebrates on earth but its nucleotide diversity is moderate (π = 0.3%), only three-fold higher than in human. Here, we present a pedigree-based estimation of the mutation rate in this species. Based on whole-genome sequencing of four parents and 12 offspring, the estimated mutation rate is 2.0 × 10-9 per base per generation. We observed a high degree of parental mosaicism indicating that a large fraction of these de novo mutations occurred during early germ cell development. The estimated mutation rate – the lowest among vertebrates analyzed to date – partially explains the discrepancy between the rather low nucleotide diversity in herring and its huge census population size. But a species like the herring will never reach its expected nucleotide diversity because of fluctuations in population size over the millions of years it takes to build up high nucleotide diversity.


2017 ◽  
Author(s):  
Hákon Jónsson ◽  
Patrick Sulem ◽  
Gudny A. Arnadottir ◽  
Gunnar Pálsson ◽  
Hannes P. Eggertsson ◽  
...  

ABSTRACTDe novo mutations (DNMs) cause a large fraction of severe rare diseases of childhood. DNMs that occur in early embryos may result in mosaicism of both somatic and germ cells. Such early mutations may be transmitted to more than one offspring and cause recurrence of serious disease. We scanned 1,007 sibling pairs from 251 families and identified 885 DNMs shared by siblings (ssDNMs) at 451 genomic sites. We estimated the probability of DNM recurrence based on presence in the blood of the parent, sharing by other siblings, parent-of-origin, mutation type, and genomic position. We detected 52.1% of ssDNMs in the parental blood. The probability of a DNM being shared goes down by 2.28% per year for paternal DNMs and 1.82% for maternal DNMs. Shared paternal DNMs are more likely to be T>C mutations than maternal ones, but less likely to be C>T mutations. Depending on DNM properties, the probability of recurrence in a younger sibling ranges from 0.013% to 29.6%. We have launched an online DNM recurrence probability calculator, to use in genetic counselling in cases of rare genetic diseases.


Sign in / Sign up

Export Citation Format

Share Document