ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING

The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.

Download Full-text

PB-Motif—A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping

Frontiers in Genetics ◽

10.3389/fgene.2021.716586 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zachary Stephens ◽

Dragana Milosevic ◽

Benjamin Kipp ◽

Stefan Grebe ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Phase Variation ◽

Variant Calling ◽

Error Rates ◽

Clinical Samples ◽

Sequencing Error ◽

Carrier Status ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Long Reads ◽

Genomic Regions

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Analyzing Low-Level mtDNA Heteroplasmy—Pitfalls and Challenges from Bench to Benchmarking

International Journal of Molecular Sciences ◽

10.3390/ijms22020935 ◽

2021 ◽

Vol 22 (2) ◽

pp. 935

Author(s):

Federica Fazzini ◽

Liane Fendt ◽

Sebastian Schönherr ◽

Lukas Forer ◽

Bernd Schöpf ◽

...

Keyword(s):

Variant Calling ◽

Illumina Miseq ◽

Bioinformatic Analysis ◽

Low Level ◽

Sequencing Technologies ◽

Laboratory Setup ◽

Dna Mixture ◽

The Individual ◽

Highly Sensitive Detection ◽

Sensitivity Specificity

Massive parallel sequencing technologies are promising a highly sensitive detection of low-level mutations, especially in mitochondrial DNA (mtDNA) studies. However, processes from DNA extraction and library construction to bioinformatic analysis include several varying tasks. Further, there is no validated recommendation for the comprehensive procedure. In this study, we examined potential pitfalls on the sequencing results based on two-person mtDNA mixtures. Therefore, we compared three DNA polymerases, six different variant callers in five mixtures between 50% and 0.5% variant allele frequencies generated with two different amplification protocols. In total, 48 samples were sequenced on Illumina MiSeq. Low-level variant calling at the 1% variant level and below was performed by comparing trimming and PCR duplicate removal as well as six different variant callers. The results indicate that sensitivity, specificity, and precision highly depend on the investigated polymerase but also vary based on the analysis tools. Our data highlight the advantage of prior standardization and validation of the individual laboratory setup with a DNA mixture model. Finally, we provide an artificial heteroplasmy benchmark dataset that can help improve somatic variant callers or pipelines, which may be of great interest for research related to cancer and aging.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology

10.1101/2021.01.25.428044 ◽

2021 ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Rapid Development ◽

Plant Genome ◽

Plant Genomics ◽

High Quality ◽

High Molecular Weight Dna ◽

Tissue Samples ◽

Sequencing Technologies ◽

The Cost ◽

New Generation

AbstractRecent years have witnessed a rapid development of sequencing technologies. Fundamental differences and limitations among various platforms impact the time, the cost and the accuracy for sequencing whole genomes. Here we designed a complete de novo plant genome generation workflow that starts from plant tissue samples and produces high-quality draft genomes with relatively modest laboratory and bioinformatic resources within seven days. To optimize our workflow we selected different species of plants which were used to extract high molecular weight DNA, to make PacBio and ONT libraries for sequencing with the Sequel I, Sequel II and GridION platforms. We assembled high-quality draft genomes of two different Eucalyptus species E. rudis, and E. camaldulensis to chromosome level without using additional scaffolding technologies. For the rapid production of de novo genome assembly of plant species we showed that our DNA extraction protocol followed by PacBio high fidelity sequencing, and assembly with new generation assemblers such as hifiasm produce excellent results. Our findings will be a valuable benchmark for groups planning wet- and dry-lab plant genomics research and for high throughput plant genomics initiatives.

Download Full-text

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Download Full-text

Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data

10.1101/292185 ◽

2018 ◽

Author(s):

Jérémie Decouchant ◽

Maria Fernandes ◽

Marcus Völp ◽

Francisco M Couto ◽

Paulo Esteves-Veríssimo

Keyword(s):

High Performance ◽

Genomic Data ◽

Sensitive Information ◽

Sensitive Data ◽

Variable Regions ◽

Fine Grained ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads

AbstractSequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Best practices for variant calling in clinical sequencing

Genome Medicine ◽

10.1186/s13073-020-00791-w ◽

2020 ◽

Vol 12 (1) ◽

Author(s):

Daniel C. Koboldt

Keyword(s):

Best Practices ◽

Single Molecule ◽

Best Practice ◽

Variant Calling ◽

Clinical Samples ◽

Clinical Genetic ◽

Inherited Disorders ◽

Clinical Sequencing ◽

Sequencing Technologies ◽

Downstream Analysis

Abstract Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the “best practice” principles in this review should be relevant to clinical variant calling in the long term.

Download Full-text

Eukaryotic systematics: a user's guide for cell biologists and parasitologists

Parasitology ◽

10.1017/s0031182010001708 ◽

2011 ◽

Vol 138 (13) ◽

pp. 1638-1663 ◽

Cited By ~ 64

Author(s):

GISELLE WALKER ◽

RICHARD G. DORRELL ◽

ALEXANDER SCHLACHT ◽

JOEL B. DACKS

Keyword(s):

Cellular Level ◽

Point Of View ◽

Eukaryotic Diversity ◽

Sequencing Technologies ◽

Human Habitat ◽

On Line ◽

Supplementary Material ◽

New Generation ◽

Evolution Of Photosynthesis ◽

Evolutionary Point

SUMMARYSingle-celled parasites like Entamoeba, Trypanosoma, Phytophthora and Plasmodium wreak untold havoc on human habitat and health. Understanding the position of the various protistan pathogens in the larger context of eukaryotic diversity informs our study of how these parasites operate on a cellular level, as well as how they have evolved. Here, we review the literature that has brought our understanding of eukaryotic relationships from an idea of parasites as primitive cells to a crystallized view of diversity that encompasses 6 major divisions, or supergroups, of eukaryotes. We provide an updated taxonomic scheme (for 2011), based on extensive genomic, ultrastructural and phylogenetic evidence, with three differing levels of taxonomic detail for ease of referencing and accessibility (see supplementary material at Cambridge Journals On-line). Two of the most pressing issues in cellular evolution, the root of the eukaryotic tree and the evolution of photosynthesis in complex algae, are also discussed along with ideas about what the new generation of genome sequencing technologies may contribute to the field of eukaryotic systematics. We hope that, armed with this user's guide, cell biologists and parasitologists will be encouraged about taking an increasingly evolutionary point of view in the battle against parasites representing real dangers to our livelihoods and lives.

Download Full-text