Prowler: a novel trimming algorithm for Oxford Nanopore sequence data

Abstract Motivation Trimming and filtering tools are useful in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford nanopore technologies (ONT) trimming and filtering tools are currently rudimentary, generally only filtering reads based on whole read average quality. This results in discarding reads that contain regions of high-quality sequence. Here, we propose Prowler, a trimmer that uses a window-based approach inspired by algorithms used to trim short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results Prowler was applied to mammalian and bacterial datasets, to assess its effect on alignment and assembly, respectively. Compared to data filtered with Nanofilt, alignments of data trimmed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler trimmed data had a lower error rate than those filtered with Nanofilt; however, this came at some cost to assembly contiguity. Availability and implementation Prowler is implemented in Python and is available at https://github.com/ProwlerForNanopore/ProwlerTrimmer. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

10.1101/2021.05.09.443332 ◽

2021 ◽

Author(s):

Simon Lee ◽

Loan T. Nguyen ◽

Ben J. Hayes ◽

Elizabeth M Ross

Keyword(s):

Sequence Data ◽

Error Rates ◽

Read Length ◽

Sequencing Analysis ◽

Sequence Alignments ◽

Lower Error ◽

Oxford Nanopore ◽

High Quality Sequence ◽

Dna Sequencing Analysis ◽

Window Approach

Motivation: Quality control (QC) tools are critical in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford Nanopore Technologies (ONT) QC is currently rudimentary, generally based on whole read average quality. This results in discarding reads that contain regions of high quality sequence. Here we propose Prowler, a multi-window approach inspired by algorithms used to QC short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results: Prowler was applied to mammalian and bacterial datasets, to assess effects on alignment and assembly respectively. Compared to Nanofilt, alignments of data QCed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler QCed data had a lower error rate than Nanofilt QCed data however this came at some cost to assembly contiguity. Availability and implementation: Prowler is implemented in Python and is available at: https://github.com/ProwlerForNanopore/ProwlerTrimmer Contact: [email protected]

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of mutation effects using a deep temporal convolutional network

Bioinformatics ◽

10.1093/bioinformatics/btz873 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2047-2052 ◽

Cited By ~ 1

Author(s):

Ha Young Kim ◽

Dongsup Kim

Keyword(s):

Latent Variable ◽

Sequence Data ◽

Generative Model ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Variable Model ◽

Convolutional Network ◽

Direct Optimization ◽

Multiple Sequence

Abstract Motivation Accurate prediction of the effects of genetic variation is a major goal in biological research. Towards this goal, numerous machine learning models have been developed to learn information from evolutionary sequence data. The most effective method so far is a deep generative model based on the variational autoencoder (VAE) that models the distributions using a latent variable. In this study, we propose a deep autoregressive generative model named mutationTCN, which employs dilated causal convolutions and attention mechanism for the modeling of inter-residue correlations in a biological sequence. Results We show that this model is competitive with the VAE model when tested against a set of 42 high-throughput mutation scan experiments, with the mean improvement in Spearman rank correlation ∼0.023. In particular, our model can more efficiently capture information from multiple sequence alignments with lower effective number of sequences, such as in viral sequence families, compared with the latent variable model. Also, we extend this architecture to a semi-supervised learning framework, which shows high prediction accuracy. We show that our model enables a direct optimization of the data likelihood and allows for a simple and stable training process. Availability and implementation Source code is available at https://github.com/ha01994/mutationTCN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

10.1101/2021.05.18.444610 ◽

2021 ◽

Author(s):

Brandon K. B. Seah ◽

Estienne C. Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Neighboring Element ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Element Elimination

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.

Download Full-text

Penicillium simile sp. nov. revealed by morphological and phylogenetic analysis

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.031682-0 ◽

2012 ◽

Vol 62 (2) ◽

pp. 451-458 ◽

Cited By ~ 8

Author(s):

Domenico Davolos ◽

Biancamaria Pietrangeli ◽

Anna Maria Persiani ◽

Oriana Maggi

Keyword(s):

Sequence Data ◽

Taxonomic Status ◽

Distinct Species ◽

Rrna Genes ◽

Chain Gene ◽

Sequencing Analysis ◽

Internal Transcribed Spacer Region ◽

Protein Coding ◽

Nuclear Loci ◽

Dna Sequencing Analysis

The morphology of three phenetically identical Penicillium isolates, collected from the bioaerosol in a restoration laboratory in Italy, displayed macro- and microscopic characteristics that were similar though not completely ascribable to Penicillium raistrickii. For this reason, a phylogenetic approach based on DNA sequencing analysis was performed to establish both the taxonomic status and the evolutionary relationships of these three peculiar isolates in relation to previously described species of the genus Penicillium. We used four nuclear loci (both rRNA and protein coding genes) that have previously proved useful for the molecular investigation of taxa belonging to the genus Penicillium at various evolutionary levels. The internal transcribed spacer region (ITS1–5.8S–ITS2), domains D1 and D2 of the 28S rDNA, a region of the tubulin beta chain gene (benA) and part of the calmodulin gene (cmd) were amplified by PCR and sequenced. Analysis of the rRNA genes and of the benA and cmd sequence data indicates the presence of three isogenic isolates belonging to a genetically distinct species of the genus Penicillium, here described and named Penicillium simile sp. nov. (ATCC MYA-4591T = CBS 129191T). This novel species is phylogenetically different from P. raistrickii and other related species of the genus Penicillium (e.g. Penicillium scabrosum), from which it can be distinguished on the basis of morphological trait analysis.

Download Full-text

A workflow for accurate metabarcoding using nanopore MinION sequencing

10.1101/2020.05.21.108852 ◽

2020 ◽

Cited By ~ 2

Author(s):

Bilgenur Baloğlu ◽

Zhewei Chen ◽

Vasco Elbrecht ◽

Thomas Braukmann ◽

Shanna MacDonald ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sequence Data ◽

Rolling Circle Amplification ◽

Error Rates ◽

Read Length ◽

Taxonomic Assignment ◽

Major Drawback ◽

Rolling Circle ◽

Sequencing Platform ◽

Sequencing Platforms

AbstractMetabarcoding has become a common approach to the rapid identification of the species composition in a mixed sample. The majority of studies use established short-read high-throughput sequencing platforms. The Oxford Nanopore MinION™, a portable sequencing platform, represents a low-cost alternative allowing researchers to generate sequence data in the field. However, a major drawback is the high raw read error rate that can range from 10% to 22%.To test if the MinION™ represents a viable alternative to other sequencing platforms we used rolling circle amplification (RCA) to generate full-length consensus DNA barcodes (658bp of cytochrome oxidase I - COI) for a bulk mock sample of 50 aquatic invertebrate species. By applying two different laboratory protocols, we generated two MinION™ runs that were used to build consensus sequences. We also developed a novel Python pipeline, ASHURE, for processing, consensus building, clustering, and taxonomic assignment of the resulting reads.We were able to show that it is possible to reduce error rates to a median accuracy of up to 99.3% for long RCA fragments (>45 barcodes). Our pipeline successfully identified all 50 species in the mock community and exhibited comparable sensitivity and accuracy to MiSeq. The use of RCA was integral for increasing consensus accuracy, but it was also the most time-consuming step during the laboratory workflow and most RCA reads were skewed towards a shorter read length range with a median RCA fragment length of up to 1262bp. Our study demonstrates that Nanopore sequencing can be used for metabarcoding but we recommend the exploration of other isothermal amplification procedures to improve consensus length.

Download Full-text

LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

10.1101/2020.11.10.376871 ◽

2020 ◽

Author(s):

Christopher Wilks ◽

Michael C. Schatz

Keyword(s):

Random Forest ◽

Cancer Cell Line ◽

Automated Analysis ◽

Error Rates ◽

Supplementary Information ◽

Splice Sites ◽

Link Type ◽

Spliced Alignment ◽

Oxford Nanopore ◽

Long Read

AbstractMotivationLong read sequencing has increased the accuracy and completeness of assemblies of various organisms’ genomes in recent months. Similarly, spliced alignments of long read RNA sequencing hold the promise of delivering much longer transcripts of existing and novel isoforms in known genes without the need for error-prone transcript assemblies from short reads. However, low coverage and high-error rates potentially hamper the widespread adoption of long-read spliced alignments in annotation updates and isoform-level expression quantifications.ResultsAddressing these issues, we first develop a simulation of error modes for both Oxford Nanopore and PacBio CCS spliced-alignments. Based on this we train a Random Forest classifier to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error. We use this classifier to label reads from the spliced-alignments of the popular aligner minimap2, run on three long read sequencing datasets, including NA12878 from Oxford Nanopore and PacBio CCS, as well as a PacBio SKBR3 cancer cell line. Finally, we compare the intron chains of the three long read alignments against individual splice sites, short read assemblies, and the output from the FLAIR pipeline on the same samples.Our results demonstrate a substantial lack of precision in determining exact splice sites for long reads during alignment on both platforms while showing some benefit from postprocessing. This work motivates the need for both better aligners and additional post-alignment processing to adjust incorrectly called putative splice-sites and clarify novel transcripts support.Availability and implementationSource code for the random forest implemented in python is available at https://github.com/schatzlab/LongTron under the MIT license. The modified version of GffCompare used to construct Table 3 and related is here: https://github.com/ChristopherWilks/gffcompare/releases/tag/0.11.2LTSupplementary InformationSupplementary notes and figures are available online.

Download Full-text

annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

Bioinformatics ◽

10.1093/bioinformatics/btaa209 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3841-3848

Author(s):

Michael Gruenstaeudl

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Sequence Data ◽

Complete Sequence ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Easy Integration ◽

Central Pillar ◽

Python Package

Abstract Motivation The submission of annotated sequence data to public sequence databases constitutes a central pillar in biological research. The surge of novel DNA sequences awaiting database submission due to the application of next-generation sequencing has increased the need for software tools that facilitate bulk submissions. This need has yet to be met with the concurrent development of tools to automate the preparatory work preceding such submissions. Results The author introduce annonex2embl, a Python package that automates the preparation of complete sequence flatfiles for large-scale sequence submissions to the European Nucleotide Archive. The tool enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles. Among other features, the software automatically accounts for length differences among the input sequences while maintaining correct annotations, automatically interlaces metadata to each record and displays a design suitable for easy integration into bioinformatic workflows. As proof of its utility, annonex2embl is employed in preparing a dataset of more than 1500 fungal DNA sequences for database submission. Availability and implementation annonex2embl is freely available via the Python package index at http://pypi.python.org/pypi/annonex2embl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Automated Purification of Dye Terminator Sequencing Reactions: An Approach to High-Throughput Capillary Electrophoresis Sequencing of Large Templates

JALA Journal of the Association for Laboratory Automation ◽

10.1016/s1535-5535-03-00006-6 ◽

2003 ◽

Vol 8 (5) ◽

pp. 19-23

Author(s):

Amy Gernon ◽

Ermias Woldu ◽

Michele Godlevski ◽

Willie Wilson ◽

Rodney C. Gilmore ◽

...

Keyword(s):

Sequence Data ◽

Rolling Circle Amplification ◽

Read Length ◽

Bacterial Artificial Chromosomes ◽

Rolling Circle ◽

Large Molecules ◽

Artificial Chromosomes ◽

High Quality Sequence ◽

Template Dna ◽

Sequence Quality

Demands for higher quantity and quality of sequence data during genome sequencing projects have led to a need for completely automated reagent systems designed to isolate, process, and analyze DNA samples. While much attention has been given to methodologies aimed at increasing the throughput of sample preparation and reaction setup, purification of the products of sequencing reactions has received less scrutiny despite the profound influence that purification has on sequence quality. Commonly used and commercially available sequencing reaction cleanup methods are not optimal for purifying sequencing reactions generated from larger templates, including bacterial artificial chromosomes (BACs) and those generated by rolling circle amplification. Theoretically, these methods would not remove the original template since they only exclude small molecules and retain large molecules in the sample. If the large template remains in the purified sample, it could understandably interfere with electrokinetic injection and capillary performance. We demonstrate that the use of MagneSil® paramagnetic particles (PMPs) to purify ABI PRISM® BigDye® sequencing reactions increases the quality and read length of sequences from large templates. The high-quality sequence data obtained by our procedure is independent of the size of template DNA used and can be completely automated on a variety of automated platforms.

Download Full-text

Constructing a Reference Genome in a Single Lab: The Possibility to Use Oxford Nanopore Technology

10.20944/preprints201906.0117.v1 ◽

2019 ◽

Author(s):

Yun Gyeong Lee ◽

Sang Chul Choi ◽

Yuna Kang ◽

Kyeong Min Kim ◽

Chon-Sik Kang ◽

...

Keyword(s):

Plant Species ◽

Genome Sequencing ◽

Reference Genome ◽

Plant Genome ◽

Read Length ◽

Sequence Information ◽

Sequencing Analysis ◽

Oxford Nanopore ◽

A Genome ◽

Long Read

The whole genome sequencing (WGS) has become a crucial tool to understand genome structure and genetic variation. The MinION sequencing of Oxford Nanopore Technologies (ONT) is an excellent approach for performing WGS and has advantages in comparison with other Next-Generation Sequencing (NGS): It is relatively inexpensive, portable, has simple library preparation, can be monitored in real-time, and has no theoretical limits on read length. Sorghum bicolor (L.) Moench is diploid (2n = 2x = 20) with a genome size of about 730 Mb, and its genome sequence information is released in the Phytozome database. Therefore, sorghum can be be used as a good reference. However, plant species have complex and large genomes compared to animals or microorganisms. As a result, complete genome sequencing is difficult for plant species. MinION sequencing that produces long-reads can be an excellent tool to overcome the weak assembly of short-reads generated from NGS by minimizing the generation of gaps or covering the repetitive sequence that appears on the plant genome. Here, we conducted the genome sequencing for S. bicolor cv. BTx623 using the MinION platform and obtained 895,678 reads and 17.9 gigabytes(Gb) (ca. 25X coverage of reference) from long-read sequence data. Through a de novo assembly using two different tools and mapped assembled contigs against the sorghum reference genome, a total of 6,124 contigs (covering 45.9%) were generated from Canu, and a total of 2,661 contigs (covering 50%) were generated from Minimap and Miniasm with a Racon pipeline. Our results provide a pipeline of long-read sequencing analysis for plant species using the MinION platform and a clue to determine the total sequencing scale for optimal coverage based on various genome sizes.

Download Full-text