PURC v2.0: a program for improved sequence inference for polyploid phylogenetics and other manifestations of the multiple-copy problem

Inferring the true biological sequences from amplicon mixtures remains a difficult bioinformatic problem. The traditional approach is to cluster sequencing reads by similarity thresholds and treat the consensus sequence of each cluster as an "operational taxonomic unit" (OTU). Recently, this approach has been improved upon by model-based methods that correct PCR and sequencing errors in order to infer "amplicon sequence variants" (ASVs). To date, ASV approaches have been used primarily in metagenomics, but they are also useful for identifying allelic or paralogous variants and for determining homeologs in polyploid organisms. To facilitate the usage of ASV methods among polyploidy researchers, we incorporated ASV inference alongside OTU clustering in PURC v2.0, a major update to PURC (Pipeline for Untangling Reticulate Complexes). In addition to preserving original PURC functions, PURC v2.0 allows users to process PacBio CCS/HiFi reads through DADA2 to generate and annotate ASVs for multiplexed data, with outputs including separate alignments for each locus ready for phylogenetic inference. In addition, PURC v2.0 features faster demultiplexing than the original version and has been updated to be compatible with Python 3. In this chapter we present results indicating that PURC v2.0 (using the ASV approach) is more likely to infer the correct biological sequences in comparison to the earlier OTU-based PURC, and describe how to prepare sequencing data, run PURC v2.0 under several different modes, and interpret the output. We expect that PURC v2.0 will provide biologists with a method for generating multi-locus "moderate data" datasets that are large enough to be phylogenetically informative and small enough for manual curation.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

Repeated consensus sequence and pseudopromoters in the four coordinately regulated tubulin genes of Chlamydomonas reinhardi.

Molecular and Cellular Biology ◽

10.1128/mcb.4.6.1115 ◽

1984 ◽

Vol 4 (6) ◽

pp. 1115-1124 ◽

Cited By ~ 45

Author(s):

K J Brunke ◽

J G Anthony ◽

E J Sternberg ◽

D P Weeks

Keyword(s):

Transcription Initiation ◽

Consensus Sequence ◽

Gene Promoter ◽

Amino Acid Sequences ◽

Sequencing Data ◽

Base Pairs ◽

Tubulin Gene ◽

Gene Set ◽

Tubulin Genes ◽

Beta 2

The 5' coding and promoter regions of the four coordinately regulated tubulin genes of Chlamydomonas reinhardi have been mapped and sequenced. DNA sequencing data shows that the predicted N-terminal amino acid sequences of Chlamydomonas alpha- and beta-tubulins closely match that of tubulins of other eucaryotes. Within the alpha 1- and alpha 2-tubulin gene set and the beta 1- and beta 2-tubulin gene set, both nucleotide sequence and intron placement are highly conserved. Transcription initiation sites have been located by primer extension analysis at 140, 141, 159, and 132 base pairs upstream of the translation initiator codon for the alpha 1-, alpha 2-, beta 1-, and beta 2-tubulin genes, respectively. Among the structures with potential regulatory significance, the most striking is a 16-base-pair consensus sequence [GCTC(G/C)AAGGC(G/T)(G/C)--(C/A)(C/A)G] which is found in multiple copies immediately upstream of the TATA box in each of the four genes. An unexpected discovery is the presence of pseudopromoter regions in two of the transcribed tubulin genes. One pseudopromoter region is located 400 base pairs upstream of the authentic alpha 2-tubulin gene promoter, whereas the other is located within the transcribed 5' noncoding region of the beta 1-tubulin gene.

Download Full-text

Next Generation Sequencing of Pooled Samples: Guideline for Variants’ Filtering

Scientific Reports ◽

10.1038/srep33735 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 31

Author(s):

Santosh Anand ◽

Eleonora Mangano ◽

Nadia Barizzone ◽

Roberta Bordoni ◽

Melissa Sorosina ◽

...

Keyword(s):

Next Generation Sequencing ◽

Low Frequency ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Errors ◽

Effective Option ◽

Sequencing Experiment ◽

Kolmogorov Smirnov ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Abstract Sequencing large number of individuals, which is often needed for population genetics studies, is still economically challenging despite falling costs of Next Generation Sequencing (NGS). Pool-seq is an alternative cost- and time-effective option in which DNA from several individuals is pooled for sequencing. However, pooling of DNA creates new problems and challenges for accurate variant call and allele frequency (AF) estimation. In particular, sequencing errors confound with the alleles present at low frequency in the pools possibly giving rise to false positive variants. We sequenced 996 individuals in 83 pools (12 individuals/pool) in a targeted re-sequencing experiment. We show that Pool-seq AFs are robust and reliable by comparing them with public variant databases and in-house SNP-genotyping data of individual subjects of pools. Furthermore, we propose a simple filtering guideline for the removal of spurious variants based on the Kolmogorov-Smirnov statistical test. We experimentally validated our filters by comparing Pool-seq to individual sequencing data showing that the filters remove most of the false variants while retaining majority of true variants. The proposed guideline is fairly generic in nature and could be easily applied in other Pool-seq experiments.

Download Full-text

Family reunion via error correction: An efficient analysis of duplex sequencing data

10.1101/469106 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nicholas Stoler ◽

Barbara Arbeithuber ◽

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

...

Keyword(s):

Error Correction ◽

Dynamic Range ◽

Pcr Amplification ◽

Cost Effective ◽

Sequencing Data ◽

Nucleotide Substitutions ◽

Low Frequencies ◽

Family Reunion ◽

Sequencing Errors ◽

Duplex Sequencing

AbstractDuplex sequencing is the most accurate approach for identification of sequence variants present at very low frequencies. Its power comes from pooling together multiple descendants of both strands of original DNA molecules, which allows distinguishing true nucleotide substitutions from PCR amplification and sequencing artifacts. This strategy comes at a cost—sequencing the same molecule multiple times increases dynamic range but significantly diminishes coverage, making whole genome duplex sequencing prohibitively expensive. Furthermore, every duplex experiment produces a substantial proportion of singleton reads that cannot be used in the analysis and are, technically, thrown away. In this paper we demonstrate that a significant fraction of these reads contains PCR or sequencing errors within duplex tags. Correction of such errors allows “reuniting” these reads with their respective families increasing the output of the method and making it more cost effective. Additionally, we combine error correction strategy with a number of algorithmic improvements in a new version of the duplex analysis software, Du Novo 2.0, readily available through Galaxy, Bioconda, and as the source code.

Download Full-text

Homology sequence analysis using GPU acceleration

10.32469/10355/66808 ◽

2018 ◽

Author(s):

◽

Huan Truong

Keyword(s):

Parallel Computing ◽

Heterogeneous Computing ◽

Computational Models ◽

Operational Taxonomic Unit ◽

Classification Problem ◽

Processing Unit ◽

Sequence Motifs ◽

Biological Sequence ◽

Sequencing Data ◽

Central Processing

A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.

Download Full-text

BugSeq 16S: NanoCLUST with Improved Consensus Sequence Classification

10.1101/2021.03.16.434153 ◽

2021 ◽

Author(s):

Ana Jung ◽

Samuel D Chorlton

Keyword(s):

Microbial Community ◽

Consensus Sequence ◽

Species Level ◽

Taxonomic Classification ◽

Misclassification Rate ◽

Nanopore Sequencing ◽

Sequence Classification ◽

Sequencing Data ◽

16S Sequencing

NanoCLUST has enabled species-level taxonomic classification from noisy nanopore 16S sequencing data for BugSeq's users and the broader nanopore sequencing community. We noticed a high misclassification rate of NanoCLUST-derived consensus 16S sequences due to its use of BLAST top hit taxonomy assignment. We replaced the consensus sequence classifier of NanoCLUST with QIIME2's VSEARCH-based classifier to enable greater accuracy. We use mock microbial community and clinical 16S sequencing data to show that this replacement results in significantly improved nanopore 16S accuracy (over 5% recall and 19% precision), and make this new tool (BugSeq 16S) freely available for academic use at BugSeq.com/free.

Download Full-text

Paired-end small RNA sequencing reveals a possible overestimation in the isomiR sequence repertoire previously reported from conventional single read data analysis

10.21203/rs.3.rs-351479/v1 ◽

2021 ◽

Author(s):

Jose Francisco Sanchez-Herrero ◽

Raquel Pluvinet ◽

Antonio Luna-de Haro ◽

Lauro Sumoy

Keyword(s):

Data Analysis ◽

Rna Sequencing ◽

Small Rna ◽

Small Rna Sequencing ◽

Sequencing Data ◽

Technical Noise ◽

Sequencing Errors ◽

Abundance And Diversity ◽

Template Dna ◽

Generation Sequencing

Abstract Background Next generation sequencing has allowed the discovery of miRNA isoforms, termed isomirs. Some isomirs are derived from imprecise processing of pre-miRNA precursors, leading to length variants. Additional variability is introduced by non-templated addition of bases at the ends or editing of internal bases, resulting in base differences relative to the template DNA sequence. We hypothesized that some component of the isomir variation reported so far could be due to systematic technical noise and not real. Results We have developed the XICRA pipeline to analyze small RNA sequencing data at the isomir level. We exploited its ability to use single or merged reads to compare isomir results derived from paired-end (PE) reads with those from single reads (SR) to address whether detectable sequence differences relative to canonical miRNAs found in isomirs are true biological variations or the result of errors in sequencing. We have detected non-negligible systematic differences between SR and PE data which primarily affect putative internally edited isomirs, and at a much smaller frequency terminal length changing isomirs. This is relevant for the identification of true isomirs in small RNA sequencing datasets. Conclusions We conclude that potential artifacts derived from sequencing errors and/or data processing could result in an overestimation of abundance and diversity of miRNA isoforms. Efforts in annotating the isomirnome should take this into account.

Download Full-text