RabbitQC: high-speed scalable quality control for sequencing data

Author(s):  
Zekun Yin ◽  
Hao Zhang ◽  
Meiyang Liu ◽  
Wen Zhang ◽  
Honglei Song ◽  
...  

Abstract Motivation Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes. Results We present RabbitQC, an extremely fast integrated quality control tool for FASTQ files, which can take full advantage of modern hardware. It includes a variety of operations and supports different sequencing technologies (Illumina, Oxford Nanopore and PacBio). RabbitQC achieves speedups between one and two orders-of-magnitude compared to other state-of-the-art tools. Availability and implementation C++ sources and binaries are available at https://github.com/ZekunYin/RabbitQC. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Vol 36 (17) ◽  
pp. 4568-4575
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4445-4447 ◽  
Author(s):  
Roberto Semeraro ◽  
Alberto Magi

Abstract Motivation The recent technological improvement of Oxford Nanopore sequencing pushed the throughput of these devices to 10–20 Gb allowing the generation of millions of reads. For these reasons, the availability of fast software packages for evaluating experimental quality by generating highly informative and interactive summary plots is of fundamental importance. Results We developed PyPore, a three module python toolbox designed to handle raw FAST5 files from quality checking to alignment to a reference genome and to explore their features through the generation of browsable HTML files. The first module provides an interface to explore and evaluate the information contained in FAST5 and summarize them into informative quality measures. The second module converts raw data in FASTQ format, while the third module allows to easily use three state-of-the-art aligners and collects mapping statistics. Availability and implementation PyPore is an open-source software and is written in Python2.7, source code is freely available, for all OS platforms, in Github at https://github.com/rsemeraro/PyPore Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Kang Hu ◽  
Neng Huang ◽  
You Zou ◽  
Xingyu Liao ◽  
Jianxin Wang

Abstract Motivation Compared with the second-generation sequencing technologies, the third-generation sequencing technologies allows us to obtain longer reads (average ∼10 kbps, maximum 900 kbps), but brings a higher error rate (∼15% error rate). Nanopolish is a variant and methylation detection tool based on hidden Markov model, which uses Oxford Nanopore sequencing data for signal-level analysis. Nanopolish can greatly improve the accuracy of assembly, whereas it is limited by long running time since most executive parts of Nanopolish is a serial and computationally expensive process. Results In this paper, we present an effective polishing tool, Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode. Experimental results show that MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode compared to the original Nanopolish. Availability and implementation MultiNanopolish is available at GitHub: https://github.com/BioinformaticsCSU/MultiNanopolish Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4255-4263 ◽  
Author(s):  
Mohammed Alser ◽  
Hasan Hassan ◽  
Akash Kumar ◽  
Onur Mutlu ◽  
Can Alkan

AbstractMotivationThe ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm.ResultsShouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.Availability and implementationhttps://github.com/CMU-SAFARI/Shouji.Supplementary informationSupplementary data are available at Bioinformatics online.


Cells ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1776
Author(s):  
Mourdas Mohamed ◽  
Nguyet Thi-Minh Dang ◽  
Yuki Ogyama ◽  
Nelly Burlet ◽  
Bruno Mugat ◽  
...  

Transposable elements (TEs) are the main components of genomes. However, due to their repetitive nature, they are very difficult to study using data obtained with short-read sequencing technologies. Here, we describe an efficient pipeline to accurately recover TE insertion (TEI) sites and sequences from long reads obtained by Oxford Nanopore Technology (ONT) sequencing. With this pipeline, we could precisely describe the landscapes of the most recent TEIs in wild-type strains of Drosophila melanogaster and Drosophila simulans. Their comparison suggests that this subset of TE sequences is more similar than previously thought in these two species. The chromosome assemblies obtained using this pipeline also allowed recovering piRNA cluster sequences, which was impossible using short-read sequencing. Finally, we used our pipeline to analyze ONT sequencing data from a D. melanogaster unstable line in which LTR transposition was derepressed for 73 successive generations. We could rely on single reads to identify new insertions with intact target site duplications. Moreover, the detailed analysis of TEIs in the wild-type strains and the unstable line did not support the trap model claiming that piRNA clusters are hotspots of TE insertions.


2020 ◽  
Vol 872 ◽  
pp. 114328
Author(s):  
Haydn J. Ward ◽  
Tobias A. Armstrong-Telfer ◽  
Stephen M. Kelly ◽  
Nathan S. Lawrence ◽  
Jay D. Wadhawan

2021 ◽  
Vol 12 ◽  
Author(s):  
Davide Bolognini ◽  
Alberto Magi

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.


2017 ◽  
Author(s):  
Krešimir Križanović ◽  
Ivan Sović ◽  
Ivan Krpelnik ◽  
Mile Šikić

AbstractNext generation sequencing technologies have made RNA sequencing widely accessible and applicable in many areas of research. In recent years, 3rd generation sequencing technologies have matured and are slowly replacing NGS for DNA sequencing. This paper presents a novel tool for RNA mapping guided by gene annotations. The tool is an adapted version of a previously developed DNA mapper – GraphMap, tailored for third generation sequencing data, such as those produced by Pacific Biosciences or Oxford Nanopore Technologies devices. It uses gene annotations to generate a transcriptome, uses a DNA mapping algorithm to map reads to the transcriptome, and finally transforms the mappings back to genome coordinates. Modified version of GraphMap is compared on several synthetic datasets to the state-of-the-art RNAseq mappers enabled to work with third generation sequencing data. The results show that our tool outperforms other tools in general mapping quality.


2021 ◽  
Author(s):  
Brandon K. B. Seah ◽  
Estienne C. Swart

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.


2020 ◽  
Vol 10 (4) ◽  
pp. 1193-1196
Author(s):  
Yoshinori Fukasawa ◽  
Luca Ermini ◽  
Hai Wang ◽  
Karen Carty ◽  
Min-Sin Cheung

We propose LongQC as an easy and automated quality control tool for genomic datasets generated by third generation sequencing (TGS) technologies such as Oxford Nanopore technologies (ONT) and SMRT sequencing from Pacific Bioscience (PacBio). Key statistics were optimized for long read data, and LongQC covers all major TGS platforms. LongQC processes and visualizes those statistics automatically and quickly.


Sign in / Sign up

Export Citation Format

Share Document