Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Bacteria are everywhere, even in your COI data: Τhe art of getting to know the unknown unknowns and shine light on the dark matter!

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64966 ◽

2021 ◽

Vol 4 ◽

Author(s):

Haris Zafeiropoulos ◽

Laura Gargan ◽

Christina Pavloudi ◽

Evangelos Pafilis ◽

Jens Carlsson

Keyword(s):

Dark Matter ◽

Phylogenetic Tree ◽

Sequence Data ◽

Marker Gene ◽

Environmental Dna ◽

Tree Of Life ◽

Taxonomic Assignment ◽

Species Identity ◽

Consensus Sequences ◽

Coi Sequences

Environmental DNA (eDNA) metabarcoding has been commonly used in recent years (Jeunen et al. 2019) for the identification of the species composition of environmental samples. By making use of genetic markers anchored in conserved gene regions, universally present acrooss the species of large taxonomy groups, eDNA metabarcoding exploits both extra- and intra-cellular DNA fragments for biodiversity assessment. However, there is not a truly “universal” marker gene that is capable of amplifying all species across different taxa (Kress et al. 2015). The mitochondrial cytochrome C oxidase subunit I gene (COI) has many of the desirable properties of a “universal" marker and has been widely used for assessing species identity in Eukaryotes, especially metazoans (Andjar et al. 2018). However, a great number of COI Operational Taxonomic Units (OTUs) or/and Amplicon Sequence Variants (ASVs) retrieved from such studies do not match reference sequences and are often referred to as “dark matter” (Deagle et al. 2014). The aim of this study was to discover the origins and identities of these COI dark matter sequences. We built a reference phylogenetic tree that included as many COI-sequence-related information across the tree of life as possible. An overview of the steps followed is presented in Fig. 1a. Briefly, the Midori reference 2 database was used to retrieve eukaryotes sequences (183,330 species). In addition, the API of the BOLD database was used as source for the corresponding Bacteria (559 genera) and Archaea (41 genera) sequences. Consensus sequences at the family level were constructed from each of these three initial COI datasets. The COI-oriented reference phylogenetic tree of life was then built by using 1,240 consensus sequences with more than 80% of those coming from eukaryotic taxa. Phylogeny-based taxonomic assignment was then used to place query sequences. The a) total number of sequences, b) sequences assigned to Eukaryotes and c) unassigned subsets of OTUs, from marine and freshwater samples, retrieved during in-house metabarcoding experiments, were placed in the reference tree (Fig. 1b). It is clear that a large proportion of sequences targeting the COI region of Eukaryotes actually represents bacterial branches in the phylogenetic tree (Fig. 1b). We conclude that COI metabarcoding studies targeting Eukaryotes may come with a great bias derived from amplification and sequencing of bacterial taxa, depending on the primer pair used. However, for the time being, publicly available bacterial COI sequences are far too few to represent the bacterial variability; thus, a reliable taxonomic identification of them is not possible. We suggest that bacterial COI sequences should be included in the reference databases used for the taxonomy assignment of OTUs/ASVs in COI-based eukaryote metabarcoding studies to allow for bacterial sequences that were amplified to be excluded enabling researchers to exclude non-target sequences. Further, the approach presented here allows researchers to better understand the unknown unknowns and shed light on the dark matter of their metabarcoding sequence data.

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer

10.1101/009613 ◽

2014 ◽

Author(s):

Josh Quick ◽

Aaron Quinlan ◽

Nicholas Loman

Keyword(s):

Single Molecule ◽

De Novo ◽

Sequence Data ◽

Bacterial Genome ◽

Model Organism ◽

Variant Calling ◽

Laptop Computer ◽

Early Access ◽

Dna Strands ◽

K 12

Background: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings: We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Conclusions: Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database at http://gigadb.org/dataset/100102

SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Genes ◽

10.3390/genes10080561 ◽

2019 ◽

Vol 10 (8) ◽

pp. 561 ◽

Cited By ~ 5

Author(s):

Luca Ferretti ◽

Chandana Tennakoon ◽

Adrian Silesian ◽

Graham Freimanis andPaolo Ribeca

Keyword(s):

Deep Sequencing ◽

High Throughput Sequencing ◽

Sequence Data ◽

Analytical Formula ◽

Real Life ◽

Variant Calling ◽

Error Rates ◽

High Coverage ◽

Sequencing Errors ◽

Very High

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

Applied and Environmental Microbiology ◽

10.1128/aem.01541-09 ◽

2009 ◽

Vol 75 (23) ◽

pp. 7537-7541 ◽

Cited By ~ 11597

Author(s):

Patrick D. Schloss ◽

Sarah L. Westcott ◽

Thomas Ryabin ◽

Justine R. Hall ◽

Martin Hartmann ◽

...

Keyword(s):

16S Rrna ◽

Software Package ◽

Sequence Data ◽

Rrna Gene ◽

Sequencing Data ◽

Laptop Computer ◽

Operational Taxonomic Units ◽

Β Diversity ◽

Single Piece

ABSTRACT mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

mSystems ◽

10.1128/msystems.00202-17 ◽

2018 ◽

Vol 3 (3) ◽

Cited By ~ 15

Author(s):

Gabriel A. Al-Ghalith ◽

Benjamin Hillmann ◽

Kaiwei Ang ◽

Robin Shields-Cutler ◽

Dan Knights

Keyword(s):

Quality Control ◽

Dna Sequences ◽

Sequence Data ◽

Background Knowledge ◽

Sequencing Technology ◽

Data Set ◽

Short Read ◽

Dna Quality ◽

Public Data ◽

User Friendly

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Characterising Foot-and-Mouth Disease Virus in Clinical Samples Using Nanopore Sequencing

Frontiers in Veterinary Science ◽

10.3389/fvets.2021.656256 ◽

2021 ◽

Vol 8 ◽

Author(s):

Emma Brown ◽

Graham Freimanis ◽

Andrew E. Shaw ◽

Daniel L. Horton ◽

Simon Gubbins ◽

...

Keyword(s):

Cell Culture ◽

Sequence Data ◽

Foot And Mouth Disease ◽

Mouth Disease ◽

Cell Culture Supernatant ◽

Strain Identification ◽

Consensus Sequences ◽

Fmdv Serotypes ◽

Mouth Disease Virus ◽

Foot And Mouth

The sequencing of viral genomes provides important data for the prevention and control of foot-and-mouth disease (FMD) outbreaks. Sequence data can be used for strain identification, outbreak tracing, and aiding the selection of the most appropriate vaccine for the circulating strains. At present, sequencing of FMD virus (FMDV) relies upon the time-consuming transport of samples to well-resourced laboratories. The Oxford Nanopore Technologies' MinION portable sequencer has the potential to allow sequencing in remote, decentralised laboratories closer to the outbreak location. In this study, we investigated the utility of the MinION to generate sequence data of sufficient quantity and quality for the characterisation of FMDV serotypes O, A, Asia 1. Prior to sequencing, a universal two-step RT-PCR was used to amplify parts of the 5′UTR, as well as the leader, capsid and parts of the 2A encoding regions of FMDV RNA extracted from three sample matrices: cell culture supernatant, tongue epithelial suspension and oral swabs. The resulting consensus sequences were compared with reference sequences generated on the Illumina MiSeq platform. Consensus sequences with an accuracy of 100% were achieved within 10 and 30 min from the start of the sequencing run when using RNA extracted from cell culture supernatants and tongue epithelial suspensions, respectively. In contrast, sequencing from swabs required up to 2.5 h. Together these results demonstrated that the MinION sequencer can be used to accurately and rapidly characterise serotypes A, O, and Asia 1 of FMDV using amplicons amplified from a variety of different sample matrices.

Characterization of an unusually conserved AluI highly reiterated DNA sequence family from the honeybee, Apis mellifera.

Genetics ◽

10.1093/genetics/134.4.1195 ◽

1993 ◽

Vol 134 (4) ◽

pp. 1195-1204

Author(s):

S Tarès ◽

J M Cornuet ◽

P Abad

Keyword(s):

Apis Mellifera ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Sequence Divergence ◽

Repeated Sequence ◽

Consensus Sequences ◽

Dna Sequence Data ◽

Repeat Class ◽

Honeybee Subspecies

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.

Gattaca: Base pair resolution mutation tracking for somatic evolution studies using agent-based models

10.1101/2021.11.08.467784 ◽

2021 ◽

Author(s):

Ryan O Schenck ◽

Gabriel Brosula ◽

Jeffrey West ◽

Simon Leedham ◽

Darryl Shibata ◽

...

Keyword(s):

Base Pair ◽

In Silico ◽

Sequence Data ◽

Agent Based Modeling ◽

Sequence Coverage ◽

Agent Based ◽

Coverage Error ◽

Somatic Evolution ◽

User Friendly ◽

Mutation Spectra

Gattaca provides the first base-pair resolution artificial genomes for tracking somatic mutations within agent based modeling. Through the incorporation of human reference genomes, mutational context, sequence coverage/error information Gattaca is able to realistically provide comparable sequence data for in-silico comparative evolution studies with human somatic evolution studies. This user-friendly method, incorporated into each in-silico cell, allows us to fully capture somatic mutation spectra and evolution.

Stability of SARS-CoV-2 Phylogenies

10.1101/2020.06.08.141127 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yatish Turakhia ◽

Bryan Thornlow ◽

Landen Gozashti ◽

Angie S. Hinrichs ◽

Jason D. Fernandes ◽

...

Keyword(s):

Binding Sites ◽

Sequence Data ◽

Scientific Discovery ◽

Lineage Tracing ◽

Protein Coding ◽

Sequencing Errors ◽

Scientific Inference ◽

Recurrent Mutations ◽

Sequence Quality ◽

Essential Sequence

AbstractThe SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation and/or recombination among viral lineages. We suggest how samples can be screened and problematic mutations removed. We also develop tools for comparing and visualizing differences among phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.ForewordWe wish to thank all groups that responded rapidly by producing these invaluable and essential sequence data. Their contributions have enabled an unprecedented, lightning-fast process of scientific discovery---truly an incredible benefit for humanity and for the scientific community. We emphasize that most lab groups with whom we associate specific suspicious alleles are also those who have produced the most sequence data at a time when it was urgently needed. We commend their efforts. We have already contacted each group and many have updated their sequences. Our goal with this work is not to highlight potential errors, but to understand the impacts of these and other kinds of highly recurrent mutations so as to identify commonalities among the suspicious examples that can improve sequence quality and analysis going forward.