MultiNanopolish: refined grouping method for reducing redundant calculations in Nanopolish

Bioinformatics ◽

10.1093/bioinformatics/btab078 ◽

2021 ◽

Author(s):

Kang Hu ◽

Neng Huang ◽

You Zou ◽

Xingyu Liao ◽

Jianxin Wang

Keyword(s):

Error Rate ◽

Supplementary Information ◽

Sequencing Data ◽

Running Time ◽

Whole Process ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Second Generation Sequencing ◽

Polishing Tool ◽

Generation Sequencing

Abstract Motivation Compared with the second-generation sequencing technologies, the third-generation sequencing technologies allows us to obtain longer reads (average ∼10 kbps, maximum 900 kbps), but brings a higher error rate (∼15% error rate). Nanopolish is a variant and methylation detection tool based on hidden Markov model, which uses Oxford Nanopore sequencing data for signal-level analysis. Nanopolish can greatly improve the accuracy of assembly, whereas it is limited by long running time since most executive parts of Nanopolish is a serial and computationally expensive process. Results In this paper, we present an effective polishing tool, Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode. Experimental results show that MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode compared to the original Nanopolish. Availability and implementation MultiNanopolish is available at GitHub: https://github.com/BioinformaticsCSU/MultiNanopolish Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RNA Transcriptome Mapping with GraphMap

10.1101/160085 ◽

2017 ◽

Cited By ~ 1

Author(s):

Krešimir Križanović ◽

Ivan Sović ◽

Ivan Krpelnik ◽

Mile Šikić

Keyword(s):

Third Generation ◽

Sequencing Data ◽

Mapping Algorithm ◽

Gene Annotations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Rna Mapping ◽

Synthetic Datasets ◽

Generation Sequencing

AbstractNext generation sequencing technologies have made RNA sequencing widely accessible and applicable in many areas of research. In recent years, 3rd generation sequencing technologies have matured and are slowly replacing NGS for DNA sequencing. This paper presents a novel tool for RNA mapping guided by gene annotations. The tool is an adapted version of a previously developed DNA mapper – GraphMap, tailored for third generation sequencing data, such as those produced by Pacific Biosciences or Oxford Nanopore Technologies devices. It uses gene annotations to generate a transcriptome, uses a DNA mapping algorithm to map reads to the transcriptome, and finally transforms the mappings back to genome coordinates. Modified version of GraphMap is compared on several synthetic datasets to the state-of-the-art RNAseq mappers enabled to work with third generation sequencing data. The results show that our tool outperforms other tools in general mapping quality.

Download Full-text

Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

Bioinformatics ◽

10.1093/bioinformatics/btab068 ◽

2021 ◽

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Xiao Du ◽

Lei Li ◽

Brock A Peters ◽

...

Keyword(s):

De Novo ◽

Substantial Improvement ◽

Supplementary Information ◽

Sequencing Data ◽

Homologous Chromosomes ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

Abstract Motivation Achieving a near complete understanding of how the genome of an individual affects the phenotypes of that individual requires deciphering the order of variations along homologous chromosomes in species with diploid genomes. However, true diploid assembly of long-range haplotypes remains challenging. Results To address this, we have developed Haplotype-resolved Assembly for Synthetic long reads using a Trio-binning strategy, or HAST, which uses parental information to classify reads into maternal or paternal. Once sorted, these reads are used to independently de novo assemble the parent-specific haplotypes. We applied HAST to co-barcoded second-generation sequencing data from an Asian individual, resulting in a haplotype assembly covering 94.7% of the reference genome with a scaffold N50 longer than 11 Mb. The high haplotyping precision (∼99.7%) and recall (∼95.9%) represents a substantial improvement over the commonly used tool for assembling co-barcoded reads (Supernova), and is comparable to a trio-binning-based third generation long-read based assembly method (TrioCanu) but with a significantly higher single-base accuracy (up to 99.99997% (Q65)). This makes HAST a superior tool for accurate haplotyping and future haplotype-based studies. Availability The code of the analysis is available at https://github.com/BGI-Qingdao/HAST. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Haplotype-aware genotyping from noisy long reads

10.1101/293944 ◽

2018 ◽

Cited By ~ 5

Author(s):

Jana Ebler ◽

Marina Haukness ◽

Trevor Pesout ◽

Tobias Marschall ◽

Benedict Paten

Keyword(s):

Error Rates ◽

Sequencing Error ◽

Sequencing Data ◽

Novel Approach ◽

Long Reads ◽

Oxford Nanopore ◽

Linkage Information ◽

Second Generation Sequencing ◽

Sequencing Platforms ◽

Generation Sequencing

MotivationCurrent genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.ResultsIn this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.

Download Full-text

Benchmarking of long-read correction methods

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa037 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Juliane C Dohm ◽

Philipp Peters ◽

Nancy Stralis-Pavese ◽

Heinz Himmelbauer

Keyword(s):

Error Rate ◽

Total Error ◽

Error Rates ◽

High Rate ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Generation Sequencing

Abstract Third-generation sequencing technologies provided by Pacific Biosciences and Oxford Nanopore Technologies generate read lengths in the scale of kilobasepairs. However, these reads display high error rates, and correction steps are necessary to realize their great potential in genomics and transcriptomics. Here, we compare properties of PacBio and Nanopore data and assess correction methods by Canu, MARVEL and proovread in various combinations. We found total error rates of around 13% in the raw datasets. PacBio reads showed a high rate of insertions (around 8%) whereas Nanopore reads showed similar rates for substitutions, insertions and deletions of around 4% each. In data from both technologies the errors were uniformly distributed along reads apart from noisy 5′ ends, and homopolymers appeared among the most over-represented kmers relative to a reference. Consensus correction using read overlaps reduced error rates to about 1% when using Canu or MARVEL after patching. The lowest error rate in Nanopore data (0.45%) was achieved by applying proovread on MARVEL-patched data including Illumina short-reads, and the lowest error rate in PacBio data (0.42%) was the result of Canu correction with minimap2 alignment after patching. Our study provides valuable insights and benchmarks regarding long-read data and correction methods.

Download Full-text

SVJedi: genotyping structural variations with long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa527 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4568-4575

Author(s):

Lolita Lecompte ◽

Pierre Peterlongo ◽

Dominique Lavenier ◽

Claire Lemaitre

Keyword(s):

Supplementary Information ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Clinical Diagnoses ◽

Long Read ◽

The One

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RabbitQC: high-speed scalable quality control for sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa719 ◽

2020 ◽

Author(s):

Zekun Yin ◽

Hao Zhang ◽

Meiyang Liu ◽

Wen Zhang ◽

Honglei Song ◽

...

Keyword(s):

Quality Control ◽

High Speed ◽

State Of The Art ◽

Supplementary Information ◽

Control Methods ◽

Sequencing Data ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Quality Control Tool ◽

Computing Platforms

Abstract Motivation Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. Results We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. Availability and implementation C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Measuring evolutionary cancer dynamics from genome sequencing, one patient at a time

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2020-0075 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Giulio Caravagna

Keyword(s):

Genome Sequencing ◽

Cancer Evolution ◽

Sequencing Data ◽

Evolutionary Forces ◽

Sequencing Technologies ◽

Cancer Genome Sequencing ◽

Multiple Resolutions ◽

Multiple Patients ◽

Single Tumour ◽

Generation Sequencing

AbstractCancers progress through the accumulation of somatic mutations which accrue during tumour evolution, allowing some cells to proliferate in an uncontrolled fashion. This growth process is intimately related to latent evolutionary forces moulding the genetic and epigenetic composition of tumour subpopulations. Understanding cancer requires therefore the understanding of these selective pressures. The adoption of widespread next-generation sequencing technologies opens up for the possibility of measuring molecular profiles of cancers at multiple resolutions, across one or multiple patients. In this review we discuss how cancer genome sequencing data from a single tumour can be used to understand these evolutionary forces, overviewing mathematical models and inferential methods adopted in field of Cancer Evolution.

Download Full-text

VikNGS: A C ++ Variant Integration Kit for Next Generation Sequencing Association Analysis

Bioinformatics ◽

10.1093/bioinformatics/btz716 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zeynep Baskurt ◽

Scott Mastromatteo ◽

Jiafen Gong ◽

Richard F Wintle ◽

Stephen W Scherer ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Association ◽

Association Analysis ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Next Generation ◽

Sequencing Data ◽

Combining Data ◽

Generation Sequencing

Abstract Integration of next generation sequencing data (NGS) across different research studies can improve the power of genetic association testing by increasing sample size and can obviate the need for sequencing controls. If differential genotype uncertainty across studies is not accounted for, combining data sets can produce spurious association results. We developed the Variant Integration Kit for NGS (VikNGS), a fast cross-platform software package, to enable aggregation of several data sets for rare and common variant genetic association analysis of quantitative and binary traits with covariate adjustment. VikNGS also includes a graphical user interface, power simulation functionality and data visualization tools. Availability The VikNGS package can be downloaded at http://www.tcag.ca/tools/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Transposon Story: From TE Content to TE Dynamic Invasion of Drosophila Genomes Using the Single-Molecule Sequencing Technology from Oxford Nanopore

Cells ◽

10.3390/cells9081776 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1776

Author(s):

Mourdas Mohamed ◽

Nguyet Thi-Minh Dang ◽

Yuki Ogyama ◽

Nelly Burlet ◽

Bruno Mugat ◽

...

Keyword(s):

Single Molecule ◽

Wild Type ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

In The Wild ◽

Successive Generations ◽

Type Strains

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.

Download Full-text