Rapid metagenomic workflow using annotated 16S RNA dataset

Mapping Intimacies ◽

10.37044/osf.io/gbt8p ◽

2021 ◽

Author(s):

Naoya Oec ◽

Hidemasa Bono

Keyword(s):

Data Analysis ◽

Life Science ◽

Metagenomic Data ◽

Use Case ◽

Sequencing Technology ◽

Metagenome Analysis ◽

Long Reads ◽

16S Rna ◽

Long Read ◽

Very High

Thanks to the dramatic progress in DNA sequencing technology, it is now possible to decipher sequences in a mixed state. Therefore, the subsequent data analysis has become important, and the demand for metagenomic analysis is very high. Existing metagenomic data analysis workflows for 16S amplicon sequences have been mainly focused on sequences from short reads sequencers, while researchers cannot apply those workflows for sequences from long read sequencers. A practical metagenome workflow for long read sequencers is therefore really needed. In a domestic version of the BioHackathon called BH21.8 held in Aomori, Japan (23-27 August 2021), we first discussed the reproducible workflow for metagenome analysis. We then designed a rapid metagenomic workflow using annotated 16S RNA dataset (Ref16S) and the practical use case for using the workflow developed. Finally, we discussed how to maintain Ref16S and requested Life Science Database Archive in JST NBDC to archive the dataset. After a stimulus discussion in BH21.8, we could clarify the current issues in the metagenomic data analysis. We also could successfully construct a rapid workflow for those data specially from long reads by using newly constructed Ref16S.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text

Assembling reads improves taxonomic classification of species

10.21203/rs.3.rs-22309/v1 ◽

2020 ◽

Author(s):

Quang Tran ◽

Vinhthuy Phan

Keyword(s):

Classification Performance ◽

Performance Characteristics ◽

Metagenomic Data ◽

Species Classification ◽

Short Read ◽

Short Reads ◽

Sequencing Errors ◽

Trade Offs ◽

Long Reads ◽

Long Read

Abstract Background: Most current metagenomic classifiers and profilers employ short reads to classify, bin and profile microbial genomes that are present in metagenomic samples. Many of these methods adopt techniques that aim to identify unique genomic regions of genomes so as to differentiate them. Because of this, short-read lengths might be suboptimal. Longer read lengths might improve the performance of classification and profiling. However, longer reads produced by current technology tend to have a higher rate of sequencing errors, compared to short reads. It is not clear if the trade-off between longer length versus higher sequencing errors will increase or decrease classification and profiling performance.Results: We compared performance of popular metagenomic classifiers on short reads and longer reads, which are assembled from the same short reads. When using a number of popular assemblers to assemble long reads from the short reads, we discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Specifically, across most classifiers, we observed a significant increase in precision, while recall remained the same, resulting in higher overall classification performance. On real metagenomic data, we observed a similar trend that classifiers made fewer predictions. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall with longer reads.Conclusions: This finding has two main implications. First, it suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall as shorter reads. Second, this finding suggests that it might be a good idea to consider utilizing long-read technologies in species classification for metagenomic applications. Current long-read technologies tend to have higher sequencing errors and are more expensive compared to short-read technologies. The trade-offs between the pros and cons should be investigated.

Download Full-text

A new method for long-read sequencing of animal mitochondrial genomes: application to the identification of equine mitochondrial DNA variants

10.1101/2019.12.20.884486 ◽

2019 ◽

Author(s):

Sophie Dhorne-Pollet ◽

Eric Barrey ◽

Nicolas Pollet

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Multiple Displacement Amplification ◽

Mitochondrial Genomes ◽

Sequencing Technology ◽

Long Reads ◽

Selective Elimination ◽

Long Read ◽

Variant Analysis ◽

Mitochondrial Dna Variants

AbstractBackgroundWe present here an approach to sequence whole mitochondrial genomes using nanopore long-read sequencing. Our method relies on the selective elimination of nuclear DNA using an exonuclease treatment and on the amplification of circular mitochondrial DNA using a multiple displacement amplification step.ResultsWe optimized each preparative step to obtain a 100 million-fold enrichment of horse mitochondrial DNA relative to nuclear DNA. We sequenced these amplified mitochondrial DNA using nanopore sequencing technology and obtained mitochondrial DNA reads that represented up to half of the sequencing output. The sequence reads were 2.3 kb of mean length and provided an even coverage of the mitochondrial genome. Long-reads spanning half or more of the whole mtDNA provided a coverage that varied between 118X and 488X. Finally, we identified SNPs with a precision of 98.1%; recall of 85.2% and a F1-score of 0.912.ConclusionsOur analyses show that our method to amplify mtDNA and to sequence it using the nanopore technology is usable for mitochondrial DNA variant analysis. With minor modifications, this approach could easily be applied to other large circular DNA molecules.

Download Full-text

Long-reads are revolutionizing 20 years of insect genome sequencing

10.1101/2021.02.14.431146 ◽

2021 ◽

Author(s):

Scott Hotaling ◽

John S. Sproul ◽

Jacqueline Heckenhauer ◽

Ashlyn Powell ◽

Amanda M. Larracuente ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Nuclear Genome ◽

Sequencing Technology ◽

Assembly Quality ◽

Insect Genome ◽

Long Reads ◽

Long Read ◽

Field Perspective ◽

Genome Assemblies ◽

The Impact

The first insect genome (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 different insects representing 20 orders. Here, we analyzed the best assembly for each insect and provide a “state of the field” perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technology. We show that while genomic efforts have been biased towards specific groups (e.g., Diptera), assemblies are generally contiguous with gene regions intact. Most notable, however, has been the impact of long-read sequencing; assemblies that incorporate long-reads are ∼48x more contiguous than those that do not.

Download Full-text

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

10.1101/259150 ◽

2018 ◽

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Long Read ◽

Sequence Types ◽

Very High

AbstractGenome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample in or out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from PacBio or Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 600 samples sequenced with using long read sequencing technologies from PacBio and Oxford Nanopore. It provides sequence types on average within 90 seconds, with a sensitivity of 94% and specificity of 97%, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Download Full-text

Near-complete Lokiarchaeota genomes from complex environmental samples using long and short read metagenomic analyses

10.1101/2019.12.17.879148 ◽

2019 ◽

Cited By ~ 3

Author(s):

Eva F. Caceres ◽

William H. Lewis ◽

Felix Homa ◽

Tom Martin ◽

Andreas Schramm ◽

...

Keyword(s):

Large Scale ◽

Phylogenetic Analyses ◽

Metagenomic Data ◽

Endosomal Sorting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Complete Genomes ◽

Culture Independent ◽

Long Read

AbstractAsgard archaea is a recently proposed superphylum currently comprised of five recognised phyla: Lokiarchaeota, Thorarchaeota, Odinarchaeota, Heimdallarchaeota and Helarchaeota. Members of this group have been identified based on culture-independent approaches with several metagenome-assembled genomes (MAGs) reconstructed to date. However, most of these genomes consist of several relatively small contigs, and, until recently, no complete Asgard archaea genome is yet available. Large scale phylogenetic analyses suggest that Asgard archaea represent the closest archaeal relatives of eukaryotes. In addition, members of this superphylum encode proteins that were originally thought to be specific to eukaryotes, including components of the trafficking machinery, cytoskeleton and endosomal sorting complexes required for transport (ESCRT). Yet, these findings have been questioned on the basis that the genome sequences that underpin them were assembled from metagenomic data, and could have been subjected to contamination and other assembly artefacts. Even though several lines of evidence indicate that the previously reported findings were not affected by these issues, having access to high-quality and preferentially fully closed Asgard archaea genomes is needed to definitively close this debate. Current long-read sequencing technologies such as Oxford Nanopore allow the generation of long reads in a high-throughput manner making them suitable for their use in metagenomics. Although the use of long reads is still limited in this field, recent analyses have shown that it is feasible to obtain complete or near-complete genomes of abundant members of mock communities and metagenomes of various level of complexity. Here, we show that long read metagenomics can be successfully applied to obtain near-complete genomes of low-abundant members of complex communities from sediment samples. We were able to reconstruct six MAGs from different Lokiarchaeota lineages that show high completeness and low fragmentation, with one of them being a near-complete genome only consisting of three contigs. Our analyses confirm that the eukaryote-like features previously associated with Lokiarchaeota are not the result of contamination or assembly artefacts, and can indeed be found in the newly reconstructed genomes.

Download Full-text

Long-Read Sequencing of the Zebrafish Genome Reorganizes Genomic Architecture

10.1101/2021.08.27.457855 ◽

2021 ◽

Author(s):

Yelena Chernyavskaya ◽

Xiaofei Zhang ◽

Jinze Liu ◽

Jessica S. Blackburn

Keyword(s):

Low Complexity ◽

Zebrafish Genome ◽

Nanopore Sequencing ◽

Sequencing Technology ◽

Short Read ◽

Short Read Sequencing ◽

Genomic Landscape ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms

Nanopore sequencing technology has revolutionized the field of genome biology with its ability to generate extra-long reads that can resolve regions of the genome that were previously inaccessible to short-read sequencing platforms. Although long-read sequencing has been used to resolve several vertebrate genomes, a nanopore-based zebrafish assembly has not yet been released. Over 50% of the zebrafish genome consists of difficult to map, highly repetitive, low complexity elements that pose inherent problems for short-read sequencers and assemblers. We used nanopore sequencing to improve upon and resolve the issues plaguing the current zebrafish reference assembly (GRCz11). Our long-read assembly improved the current resolution of the reference genome by identifying 1,697 novel insertions and deletions over 1Kb in length and placing 106 previously unlocalized scaffolds. We also discovered additional sites of retrotransposon integration previously unreported in GRCz11 and observed their expression in adult zebrafish under physiologic conditions, implying they have active mobility in the zebrafish genome and contribute to the ever-changing genomic landscape.

Download Full-text

QAlign: aligning nanopore reads accurately using current-level modeling

Bioinformatics ◽

10.1093/bioinformatics/btaa875 ◽

2020 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

Supplementary Information ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

Abstract Motivation Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner. Results We show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets. Availability and implementation https://github.com/joshidhaivat/QAlign.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

10.1101/694554 ◽

2019 ◽

Author(s):

Sam Kovaka ◽

Aleksey V. Zimin ◽

Geo M. Pertea ◽

Roham Razaghi ◽

Steven L. Salzberg ◽

...

Keyword(s):

Single Molecule ◽

Transcriptome Assembly ◽

Rna Seq ◽

High Error Rate ◽

Sequencing Technology ◽

Ability To Work ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

AbstractRNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate. It also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies. On 33 short-read datasets from humans and two plant species, StringTie2 is 47.3% more precise and 3.9% more sensitive than Scallop. On multiple long read datasets, StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision. StringTie2 is also faster and has a smaller memory footprint than all comparable tools.

Download Full-text