rMETL: sensitive mobile element insertion detection with long read realignment

Tao Jiang; Bo Liu; Junyi Li; Yadong Wang

doi:10.1093/bioinformatics/btz106

rMETL: sensitive mobile element insertion detection with long read realignment

Bioinformatics ◽

10.1093/bioinformatics/btz106 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3484-3486 ◽

Cited By ~ 3

Author(s):

Tao Jiang ◽

Bo Liu ◽

Junyi Li ◽

Yadong Wang

Keyword(s):

Rapid Development ◽

Mobile Element ◽

Error Rates ◽

Supplementary Information ◽

Sequencing Error ◽

Complex Signals ◽

Element Insertion ◽

Sequencing Technologies ◽

Long Read ◽

Mobile Element Insertion

Abstract Summary Mobile element insertion (MEI) is a major category of structure variations (SVs). The rapid development of long read sequencing technologies provides the opportunity to detect MEIs sensitively. However, the signals of MEI implied by noisy long reads are highly complex due to the repetitiveness of mobile elements as well as the high sequencing error rates. Herein, we propose the Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). Benchmarking results of simulated and real datasets demonstrate that rMETL enables to handle the complex signals to discover MEIs sensitively. It is suited to produce high-quality MEI callsets in many genomics studies. Availability and implementation rMETL is available from https://github.com/hitbc/rMETL. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

rMETL: sensitive mobile element insertion detection with long read realignment

10.1101/421560 ◽

2018 ◽

Author(s):

Tao Jiang ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

Rapid Development ◽

Mobile Element ◽

Supplementary Information ◽

High Quality ◽

Element Insertion ◽

Sequencing Errors ◽

Detection Tool ◽

Long Reads ◽

Long Read ◽

Mobile Element Insertion

AbstractSummaryMobile element insertion (MEI) is a major category of structure variations (SVs). The rapid development of long read sequencing provides the opportunity to sensitively discover MEIs. However, the signals of MEIs implied by noisy long reads are highly complex, due to the repetitiveness of mobile elements as well as the serious sequencing errors. Herein, we propose Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). rMETL takes advantage of its novel chimeric read re-alignment approach to well handle complex MEI signals. Benchmarking results on simulated and real datasets demonstrated that rMETL has the ability to more sensitivity discover MEIs as well as prevent false positives. It is suited to produce high quality MEI callsets in many genomics studies.Availability and Implementation: rMETL is available from https://github.com/hitbc/rMETL.Contact:[email protected] information: Supplementary data are available at Bioinformatics online.

Download Full-text

Long-read amplicon denoising

10.1101/383794 ◽

2018 ◽

Cited By ~ 3

Author(s):

Venkatesh Kumar ◽

Thomas Vollbrecht ◽

Mark Chernyshev ◽

Sanjay Mohan ◽

Brian Hanst ◽

...

Keyword(s):

Ground Truth ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Medium Length ◽

Long Reads ◽

Long Read ◽

Error Profiles

Long-read next generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies. Called “amplicon denoising”, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not appear to generalize well to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads (here ~2.6kb) and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower.On one real dataset with ground truth, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method.Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD) are implemented purely in the Julia scientific computing language, and are hereby released along with a complete toolkit of functions that allow long-read amplicon sequence analysis pipelines to be constructed in pure Julia. Further, we make available a webserver to dramatically simplify the processing of long-read PacBio sequences.

Download Full-text

NextPolish: a fast and efficient genome polishing tool for long-read assembly

Bioinformatics ◽

10.1093/bioinformatics/btz891 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2253-2255 ◽

Cited By ~ 11

Author(s):

Jiang Hu ◽

Junpeng Fan ◽

Zongyi Sun ◽

Shanlin Liu

Keyword(s):

Error Rates ◽

Supplementary Information ◽

Sequencing Technologies ◽

Large Numbers ◽

Long Reads ◽

Long Read ◽

Genome Assemblies ◽

Polishing Tool ◽

Sequence Errors ◽

Plant Arabidopsis Thaliana

Abstract Motivation Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors. Results When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy. Availability and implementation NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MetaBCC-LR: metagenomics binning by coverage and composition for long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa441 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i3-i11

Author(s):

Anuradha Wickramarachchi ◽

Vijini Mallawaarachchi ◽

Vaibhav Rajan ◽

Yu Lin

Keyword(s):

Error Rates ◽

Supplementary Information ◽

Metagenomic Data ◽

Sequencing Technologies ◽

Input Size ◽

Long Reads ◽

Wide Range ◽

Long Read ◽

Oligonucleotide Composition ◽

Species Specific

Abstract Motivation Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. Results We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. Availability and implementation The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Long-read amplicon denoising

Nucleic Acids Research ◽

10.1093/nar/gkz657 ◽

2019 ◽

Vol 47 (18) ◽

pp. e104-e104 ◽

Cited By ~ 12

Author(s):

Venkatesh Kumar ◽

Thomas Vollbrecht ◽

Mark Chernyshev ◽

Sanjay Mohan ◽

Brian Hanst ◽

...

Keyword(s):

Ground Truth ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Error Profiles ◽

Virus Community

Abstract Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called ‘amplicon denoising’, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

Download Full-text

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Toolbox for Mobile-Element Insertion Detection on Cancer Genomes

Cancer Informatics ◽

10.4137/cin.s24657 ◽

2015 ◽

Vol 14s1 ◽

pp. CIN.S24657

Author(s):

Wan-Ping Lee ◽

Jiantao Wu ◽

Gabor T. Marth

Keyword(s):

Human Genome ◽

Mobile Element ◽

Mobile Elements ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Element Insertion ◽

Human Genome Evolution ◽

Mobile Element Insertion

Mobile elements constitute greater than 45% of the human genome as a result of repeated insertion events during human genome evolution. Although most of mobile elements are fixed within the human population, some elements (including ALU, long interspersed elements (LINE) 1 (L1), and SVA) are still actively duplicating and may result in life-threatening human diseases such as cancer, motivating the need for accurate mobile-element insertion (MEI) detection tools. We developed a software package, TANGRAM, for MEI detection in next-generation sequencing data, currently serving as the primary MEI detection tool in the 1000 Genomes Project. TANGRAM takes advantage of valuable mapping information provided by our own MOSAIK mapper, and until recently required MOSAIK mappings as its input. In this study, we report a new feature that enables TANGRAM to be used on alignments generated by any mainstream short-read mapper, making it accessible for many genomic users. To demonstrate its utility for cancer genome analysis, we have applied TANGRAM to the TCGA (The Cancer Genome Atlas) mutation calling benchmark 4 dataset. TANGRAM is fast, accurate, easy to use, and open source on https://github.com/jiantao/Tangram .

Download Full-text

Tangram: a comprehensive toolbox for mobile element insertion detection

BMC Genomics ◽

10.1186/1471-2164-15-795 ◽

2014 ◽

Vol 15 (1) ◽

pp. 795 ◽

Cited By ~ 37

Author(s):

Jiantao Wu ◽

Wan-Ping Lee ◽

Alistair Ward ◽

Jerilyn A Walker ◽

Miriam K Konkel ◽

...

Keyword(s):

Mobile Element ◽

Element Insertion ◽

Mobile Element Insertion

Download Full-text

Mobile genetic element insertions drive antibiotic resistance across pathogens

10.1101/527788 ◽

2019 ◽

Cited By ~ 1

Author(s):

Matthew G. Durrant ◽

Michelle M. Li ◽

Ben Siranosian ◽

Ami S. Bhatt

Keyword(s):

Antibiotic Resistance ◽

De Novo ◽

Bacterial Species ◽

Mobile Element ◽

Mobile Genetic Elements ◽

Mobile Elements ◽

Genetic Elements ◽

Element Insertion ◽

A Genome ◽

Mobile Element Insertion

AbstractMobile genetic elements contribute to bacterial adaptation and evolution; however, detecting these elements in a high-throughput and unbiased manner remains challenging. Here, we demonstrate ade novoapproach to identify mobile elements from short-read sequencing data. The method identifies the precise site of mobile element insertion and infers the identity of the inserted sequence. This is an improvement over previous methods that either rely on curated databases of known mobile elements or rely on ‘split-read’ alignments that assume the inserted element exists within the reference genome. We apply our approach to 12,419 sequenced isolates of nine prevalent bacterial pathogens, and we identify hundreds of known and novel mobile genetic elements, including many candidate insertion sequences. We find that the mobile element repertoire and insertion rate vary considerably across species, and that many of the identified mobile elements are biased toward certain target sequences, several of them being highly specific. Mobile element insertion hotspots often cluster near genes involved in mechanisms of antibiotic resistance, and such insertions are associated with antibiotic resistance in laboratory experiments and clinical isolates. Finally, we demonstrate that mutagenesis caused by these mobile elements contributes to antibiotic resistance in a genome-wide association study of mobile element insertions in pathogenicEscherichia coli. In summary, by applying ade novoapproach to precisely identify mobile genetic elements and their insertion sites, we thoroughly characterize the mobile element repertoire and insertion spectrum of nine pathogenic bacterial species and find that mobile element insertions play a significant role in the evolution of clinically relevant phenotypes, such as antibiotic resistance.

Download Full-text

Minimizer-space de Bruijn graphs

10.1101/2021.06.09.447586 ◽

2021 ◽

Author(s):

Barış Ekim ◽

Bonnie Berger ◽

Rayan Chikhi

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Graphical Representation ◽

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Human Genome Assembly ◽

Long Read ◽

Metagenome Assembly

DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the problem of assembling such reads into genomes, which poses challenges in terms of accuracy and computational resources when using cutting-edge assembly approaches, e.g. those based on overlapping reads using minimizer sketches. Here, we introduce the concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet. By projecting DNA sequences into ordered lists of minimizers, our key idea is to enumerate what we call k-min-mers, that are k-mers over a larger alphabet consisting of minimizer tokens. Our approach, mdBG or minimizer-dBG, achieves orders-of magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. We demonstrate three uses cases of mdBG: human genome assembly, metagenome assembly, and the representation of large pangenomes. For assembly, we implemented mdBG in software we call rust-mdbg, resulting in ultra-fast, low memory and highly-contiguous assembly of PacBio HiFi reads. A human genome is assembled in under 10 minutes using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 minutes using 1 GB RAM. For pangenome graphs, we newly allow a graphical representation of a collection of 661,405 bacterial genomes as an mdBG and successfully search it (in minimizer-space) for anti-microbial resistance (AMR) genes. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics.

Download Full-text