Long-read DNA metabarcoding of ribosomal rRNA in the analysis of fungi from aquatic environments

ABSTRACTDNA metabarcoding is now widely used to study prokaryotic and eukaryotic microbial diversity. Technological constraints have limited most studies to marker lengths of ca. 300-600 bp. Longer sequencing reads of several 5 thousand bp are now possible with third-generation sequencing. The increased marker lengths provide greater taxonomic resolution and enable the use of phylogenetic methods of classifcation, but longer reads may be subject to higher rates of sequencing error and chimera formation. In addition, most well-established bioinformatics tools for DNA metabarcoding were originally 10 designed for short reads and are therefore not suitable. Here we used Pacifc Biosciences circular consensus sequencing (CCS) to DNA-metabarcode environmental samples using a ca. 4,500 bp marker that included most of the eukaryote ribosomal SSU and LSU rRNA genes and the ITS spacer region. We developed a long-read analysis pipeline that reduced error rates to levels 15 comparable to short-read platforms. Validation using fungal isolates and a mock community indicated that our pipeline detected 98% of chimeras de novo i.e., even in the absence of reference sequences. We recovered 947 OTUs from water and sediment samples in a natural lake, 848 of which could be classifed to phylum, 486 to family, 397 to genus and 330 to species. By 20 allowing for the simultaneous use of three global databases (Unite, SILVA, RDP LSU), long-read DNA metabarcoding provided better taxonomic resolution than any single marker. We foresee the use of long reads enabling the cross-validation of reference sequences and the synthesis of ribosomal rRNA gene databases. The universal nature of the rRNA operon and our recovery of >100 25 non-fungal OTUs indicate that long-read DNA metabarcoding holds promise for the study of eukaryotic diversity more broadly.

Download Full-text

HashSeq: A Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Datasets

10.1101/2021.01.29.428714 ◽

2021 ◽

Author(s):

Farnaz Fouladi ◽

Jacqueline B Young ◽

Anthony A Fodor

Keyword(s):

High Resolution ◽

16S Rrna ◽

16S Rrna Gene ◽

De Novo ◽

16S Rrna Gene Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Rrna Gene ◽

Sequence Variants ◽

Background Error

16S rRNA gene sequencing is a common and cost-effective technique for characterization of microbial communities. Recent bioinformatics methods enable high-resolution detection of sequence variants of only one nucleotide difference. In this manuscript, we utilize a very fast HashMap-based approach to detect sequence variants in six publicly available 16S rRNA gene datasets. We then use the normal distribution combined with LOESS regression to estimate background error rates as a function of sequencing depth for individual clusters of sequences. This method is computationally efficient and produces inference that yields sets of variants that are conservative and well supported by reference databases. We argue that this approach to inference is fast, simple, scalable to large datasets, and provides a high-resolution set of sequence variants which are less likely to be the result of sequencing error.

Download Full-text

Microdiversity and phylogeographic diversification of bacterioplankton in pelagic freshwater systems revealed through long-read amplicon sequencing

Microbiome ◽

10.1186/s40168-020-00974-y ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Yusuke Okazaki ◽

Shohei Fujinaga ◽

Michaela M. Salcher ◽

Cristiana Callieri ◽

Atsushi Tanaka ◽

...

Keyword(s):

16S Rrna ◽

Regional Scale ◽

Scale Up ◽

Amplicon Sequencing ◽

Freshwater Ecosystems ◽

16S Rrna Genes ◽

Rrna Genes ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Long Read

Abstract Background Freshwater ecosystems are inhabited by members of cosmopolitan bacterioplankton lineages despite the disconnected nature of these habitats. The lineages are delineated based on > 97% 16S rRNA gene sequence similarity, but their intra-lineage microdiversity and phylogeography, which are key to understanding the eco-evolutional processes behind their ubiquity, remain unresolved. Here, we applied long-read amplicon sequencing targeting nearly full-length 16S rRNA genes and the adjacent ribosomal internal transcribed spacer sequences to reveal the intra-lineage diversities of pelagic bacterioplankton assemblages in 11 deep freshwater lakes in Japan and Europe. Results Our single nucleotide-resolved analysis, which was validated using shotgun metagenomic sequencing, uncovered 7–101 amplicon sequence variants for each of the 11 predominant bacterial lineages and demonstrated sympatric, allopatric, and temporal microdiversities that could not be resolved through conventional approaches. Clusters of samples with similar intra-lineage population compositions were identified, which consistently supported genetic isolation between Japan and Europe. At a regional scale (up to hundreds of kilometers), dispersal between lakes was unlikely to be a limiting factor, and environmental factors or genetic drift were potential determinants of population composition. The extent of microdiversification varied among lineages, suggesting that highly diversified lineages (e.g., Iluma-A2 and acI-A1) achieve their ubiquity by containing a consortium of genotypes specific to each habitat, while less diversified lineages (e.g., CL500-11) may be ubiquitous due to a small number of widespread genotypes. The lowest extent of intra-lineage diversification was observed among the dominant hypolimnion-specific lineage (CL500-11), suggesting that their dispersal among lakes is not limited despite the hypolimnion being a more isolated habitat than the epilimnion. Conclusions Our novel approach complemented the limited resolution of short-read amplicon sequencing and limited sensitivity of the metagenome assembly-based approach, and highlighted the complex ecological processes underlying the ubiquity of freshwater bacterioplankton lineages. To fully exploit the performance of the method, its relatively low read throughput is the major bottleneck to be overcome in the future.

Download Full-text

A natural case of RIP: degeneration of the DNA sequence in an ancestral tandem duplication

Molecular and Cellular Biology ◽

10.1128/mcb.9.10.4416-4421.1989 ◽

1989 ◽

Vol 9 (10) ◽

pp. 4416-4421

Author(s):

W S Grayburn ◽

E U Selker

Keyword(s):

Dna Methylation ◽

Tandem Duplication ◽

De Novo ◽

5S Rrna ◽

Rrna Genes ◽

Structural Basis ◽

Rrna Gene ◽

Oak Ridge ◽

5S Rrna Gene ◽

5S Rrna Genes

5S rRNA genes of Neurospora crassa are generally dispersed in the genome and are unmethylated. The xi-eta region of Oak Ridge strains represents an informative exception. Most of the cytosines in this region, which consists of a diverged tandem duplication of a 0.8-kilobase-pair segment including a 5S rRNA gene, appear to be methylated (E. U. Selker and J. N. Stevens, Proc. Natl. Acad. Sci. USA 82:8114-8118, 1985). Previous work demonstrated that the xi-eta region functions as a portable signal for de novo DNA methylation (E. U. Selker and J. N. Stevens, Mol. Cell. Biol. 7:1032-1038, 1987; E. U. Selker, B. C. Jensen, and G. A. Richardson, Science 238:48-53, 1987). To identify the structural basis of this property, we have isolated and characterized an unmethylated allele of the xi-eta region from N. crassa Abbott 4. The Abbott 4 allele includes a single 5S rRNA gene, theta, which is different from all previously identified Neurospora 5S rRNA genes. Sequence analysis suggests that the xi-eta region arose from the theta region by duplication of a 794-base-pair segment followed by 267 G.C to A.T mutations in the duplicated DNA. The distribution of these mutations is not random. We propose that the RIP process of N. crassa (E. U. Selker, E. B. Cambareri, B. C. Jensen, and K. R. Haack, Cell 51:741-752, 1987; E. U. Selker, and P. W. Garrett, Proc. Natl. Acad. Sci. USA 85:6870-6874, 1988; E. B. Cambareri, B. C. Jensen, E. Schabtach, and E. U. Selker, Science 244:1571-1575, 1989) is responsible for the numerous transition mutations and DNA methylation in the xi-eta region. A long homopurine-homopyrimidine stretch immediately following the duplicated segment is 9 base pairs longer in the Oak Ridge allele than in the Abbott 4 allele. Triplex DNA, known to occur in homopurine-homopyrimidine sequences, may have mediated the tandem duplication.

Download Full-text

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Nature Biotechnology ◽

10.1038/s41587-020-0719-5 ◽

2020 ◽

Author(s):

David Porubsky ◽

◽

Peter Ebert ◽

Peter A. Audano ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Single Cell ◽

Genome Assembly ◽

De Novo ◽

Error Rates ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Parental Data ◽

Human Genome Assembly ◽

Long Read

AbstractHuman genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Download Full-text

High contiguity long read assembly of Brassica nigra allows localization of active centromeres and provides insights into the ancestral Brassica genome

10.1101/2020.02.03.932665 ◽

2020 ◽

Cited By ~ 5

Author(s):

Sampath Perumal ◽

Chu Shin Koh ◽

Lingling Jin ◽

Miles Buchwaldt ◽

Erin Higgins ◽

...

Keyword(s):

De Novo ◽

Low Complexity ◽

Error Rates ◽

Brassica Nigra ◽

Genome Integrity ◽

Ancestral Genome ◽

Genomic Distance ◽

Long Read ◽

Genome Assemblies ◽

Technology Comparison

AbstractHigh-quality nanopore genome assemblies were generated for two Brassica nigra genotypes (Ni100 and CN115125); a member of the agronomically important Brassica species. The N50 contig length for the two assemblies were 17.1 Mb (58 contigs) and 0.29 Mb (963 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short read assembly for Ni100 corroborated genome integrity and quantified sequence related error rates (0.002%). The contiguity and coverage allowed unprecedented access to low complexity regions of the genome. Pericentromeric regions and coincidence of hypo-methylation enabled localization of active centromeres and identified a novel centromere-associated ALE class I element which appears to have proliferated through relatively recent nested transposition events (<1 million years ago). Computational abstraction was used to define a post-triplication Brassica specific ancestral genome and to calculate the extensive rearrangements that define the genomic distance separating B. nigra from its diploid relatives.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text

Preliminary analysis of New Zealand scampi (Metanephrops challengeri) diet using metabarcoding

10.7287/peerj.preprints.26667 ◽

2018 ◽

Author(s):

Aimee L van der Reis ◽

Olivier Laroche ◽

Andrew G Jeffs ◽

Shane D Lavery

Keyword(s):

New Zealand ◽

Preliminary Analysis ◽

18S Rrna ◽

18S Rrna Gene ◽

Coi Gene ◽

Taxonomic Resolution ◽

Gut Contents ◽

Rrna Gene ◽

Diverse Range ◽

Dna Metabarcoding

Deep sea lobsters are highly valued for seafood and provide the basis of important commercial fisheries in many parts of the world. Despite their economic significance, relatively little is known about their natural diets. Microscopic analyses of foregut content in some species have suffered from low taxonomic resolution, with many of the dietary items difficult to reliably identify as their tissue is easily digested. DNA metabarcoding has the potential to provide greater taxonomic resolution of the diet of the New Zealand scampi (Metanephrops challengeri) through the identification of gut contents, but a number of methodological concerns need to be overcome first to ensure optimum DNA metabarcoding results. In this study, a range of methodological parameters were tested to determine the optimum protocols for DNA metabarcoding, and provide a first view of M. challengeri diet. Several PCR protocols were tested, using two universal primer pairs targeting the 18S rRNA and COI genes, on DNA extracted from both frozen and ethanol preserved samples for both foregut and hindgut digesta. The selection of appropriate DNA polymerases, buffers and methods for reducing PCR inhibitors (including the use of BSA) were found to be critical. Amplification from frozen or ethanol preserved gut contents appeared similarly dependable, but metabarcoding outcomes indicated that the ethanol samples produced better results from the COI gene. The COI gene was found to be more effective than 18S rRNA gene for identifying large eukaryotic taxa from the digesta, however, it was less successfully amplified. The 18S rRNA gene was more easily amplified, but identified mostly smaller marine organisms such as plankton and parasites. This preliminary analysis of the diet of M. challengeri identified a range of species (13,541 reads identified as diet), which included the ghost shark (Hydrolagus novaezealandiae), silver warehou (Seriolella punctate), tall sea pen (Funiculina quadrangularis) and the salp (Ihlea racovitza), suggesting that they have a varied diet, with a high reliance on scavenging a diverse range of pelagic and benthic species from the seafloor.

Download Full-text

De novo clustering methods out-perform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units

10.7287/peerj.preprints.1511 ◽

2015 ◽

Author(s):

Sarah L Westcott ◽

Patrick Schloss

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

De Novo ◽

Rrna Gene ◽

Clustering Methods ◽

Gene Sequences ◽

16S Rrna Gene Sequences ◽

Reference Methods ◽

Reference Sequences

Background. 16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important.Methods. Our analysis implemented six de novo clustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew’s Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments.Results. The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a high level of sensitivity to detect reference sequences, the specificity of those matches was poor relative to the true best match.Discussion. Our analysis calls into question the quality and stability of OTU assignments generated by the open and closed-reference methods as implemented in current version of QIIME. This study demonstrates that de novo methods are the optimal method of assigning sequences into OTUs and that the quality of these assignments needs to be assessed for multiple methods to identify the optimal clustering method for a particular dataset.

Download Full-text

Establishment and Assessment of An Amplicon Sequencing Method Targeting The 16S-ITS-23S rRNA Operon For Analysis of The Equine Gut Microbiome

10.21203/rs.3.rs-156589/v1 ◽

2021 ◽

Author(s):

Yuta Kinoshita ◽

Hidekazu NIWA ◽

Eri UCHIDA-FUJII ◽

Toshio NUKADA

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Amplicon Sequencing ◽

Full Length ◽

Rrna Operon ◽

Rrna Genes ◽

Taxonomic Resolution ◽

23S Rrna ◽

Rrna Gene ◽

Fecal Samples

Abstract Microbial communities are commonly studied by using amplicon sequencing of part of the 16S rRNA gene. Sequencing of the full-length 16S rRNA gene can provide higher taxonomic resolution and accuracy. To obtain even higher taxonomic resolution, with as few false-positives as possible, we assessed a method using long amplicon sequencing targeting the rRNA operon combined with a CCMetagen pipeline. Taxonomic assignment had >90% accuracy at the species level in a mock sample and at the family level in equine fecal samples, generating similar taxonomic composition as shotgun sequencing. The rRNA operon amplicon sequencing of equine fecal samples underestimated compositional percentages of bacterial strains containing unlinked rRNA genes by a third to almost a half, but unlinked rRNA genes had a limited effect on the overall results. The rRNA operon amplicon sequencing with the A519F + U2428R primer set was able to reflect archaeal genomes, whereas full-length 16S rRNA with 27F + 1492R could not. Therefore, we conclude that amplicon sequencing targeting the rRNA operon captures more detailed variations of bacterial and archaeal microbiota.

Download Full-text

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

10.1101/464420 ◽

2018 ◽

Author(s):

Giulia Guidi ◽

Marquita Ellis ◽

Daniel Rokhsar ◽

Katherine Yelick ◽

Aydın Buluç

Keyword(s):

Genome Assembly ◽

Probabilistic Model ◽

De Novo ◽

Sparse Matrix ◽

Genome Structure ◽

Matrix Multiplication ◽

Synthetic Data ◽

Error Rates ◽

Species Variation ◽

Long Read

AbstractRecent advances in long-read sequencing enable the characterization of genome structure and its intra- and inter-species variation at a resolution that was previously impossible. Detecting overlaps between reads is integral to many long-read genomics pipelines, such as de novo genome assembly. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments via sparse matrix-matrix multiplication that balances the goals of recall and precision, performing well on both.We present a probabilistic model that demonstrates the feasibility of using short k-mers for detecting candidate overlaps. We then introduce a notion of reliable k-mers based on our probabilistic model. Combining reliable k-mers with our binning mechanism eliminates both the k-mer set explosion that would otherwise occur with highly erroneous reads and the spurious overlaps from k-mers originating in repetitive regions. Finally, we present a new method based on Chernoff bounds for separating true overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Our methodologies aim at maximizing the balance between precision and recall. On both real and synthetic data, BELLA performs amongst the best in terms of F1 score, showing performance stability which is often missing for competitor software. BELLA’s F1 score is consistently within 1.7% of the top entry. Notably, we show improved de novo assembly results on synthetic data when coupling BELLA with the Miniasm assembler.

Download Full-text