PacBio library preparation using blunt-end adapter ligation produces significant artefactual fusion DNA sequences

Mapping Intimacies ◽

10.1101/245241 ◽

2018 ◽

Author(s):

Paul Griffith ◽

Castle Raley ◽

David Sun ◽

Yongmei Zhao ◽

Zhonghe Sun ◽

...

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Turnaround Time ◽

Library Preparation ◽

Adapter Ligation ◽

Sequence Production ◽

Long Reads ◽

Complex Structural Variation ◽

Long Read ◽

Complex Structural

AbstractPacific Biosciences’ (PacBio) RS II sequencer, utilizing Single-Molecule, Real-Time (SMRT) technology, has revolutionized next-generation sequencing by providing an accurate long-read platform. PacBio single-molecule long reads have been used to delineate complex spliceoforms, detect mutations in highly homologous sequences, identify mRNA chimeras and chromosomal translocations, accurately haplotype phasing over multiple kilobase distances and aid in assembly of genomes with complex structural variation. The PacBio protocol for preparation of sequencing templates employs blunt-end hairpin adapter ligation, which enables a short turnaround time for sequence production. However, we have found a significant portion of sequencing yield contains chimeric reads resulting from blunt-end ligation of multiple template molecules to each other prior to adapter ligation. These artefactual fusion DNA sequences pose a major challenge to analysis and can lead to false-positive detection of fusion events. We assessed the frequency of artefactual fusion when using blunt-end adapter ligation and compared it to an alternative method using A/T overhang adapter ligation. The A/T overhang adapter ligation method showed a vast improvement in limiting artefactual fusion events and is now our recommended procedure for adapter ligation during PacBio library preparation.

Accurate detection of complex structural variations using single molecule sequencing

10.1101/169557 ◽

2017 ◽

Cited By ~ 31

Author(s):

Fritz J. Sedlazeck ◽

Philipp Rescheneder ◽

Moritz Smolka ◽

Han Fang ◽

Maria Nattestad ◽

...

Keyword(s):

Single Molecule ◽

Error Rates ◽

Structural Variations ◽

Link Type ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Human Disorders ◽

Complex Structural ◽

The Cost

AbstractStructural variations (SVs) are the largest source of genetic variation, but remain poorly understood because of limited genomics technology. Single molecule long read sequencing from Pacific Biosciences and Oxford Nanopore has the potential to dramatically advance the field, although their high error rates challenge existing methods. Addressing this need, we introduce open-source methods for long read alignment (NGMLR, https://github.com/philres/ngmlr) and SV identification (Sniffles, https://github.com/fritzsedlazeck/Sniffles) that enable unprecedented SV sensitivity and precision, including within repeat-rich regions and of complex nested events that can have significant impact on human disorders. Examining several datasets, including healthy and cancerous human genomes, we discover thousands of novel variants using long reads and categorize systematic errors in short-read approaches. NGMLR and Sniffles are further able to automatically filter false events and operate on low amounts of coverage to address the cost factor that has hindered the application of long reads in clinical and research settings.

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Single-molecule long-read sequencing reveals a conserved selection mechanism determining intact long RNA and miRNA profiles in sperm

10.1101/2020.05.28.122382 ◽

2020 ◽

Author(s):

Yu H. Sun ◽

Anqi Wang ◽

Chi Song ◽

Rajesh K. Srivastava ◽

Kin Fai Au ◽

...

Keyword(s):

Single Molecule ◽

Ribosomal Proteins ◽

Selection Process ◽

Future Research ◽

Selection Mechanism ◽

Bioinformatics Pipeline ◽

Long Reads ◽

Evolutionarily Conserved ◽

Long Read ◽

Early Trauma

AbstractSperm contributes diverse RNAs to the zygote. While sperm small RNAs have been shown to be shaped by paternal environments and impact offspring phenotypes, we know little about long RNAs in sperm, including mRNAs and long non-coding RNAs. Here, by integrating PacBio single-molecule long reads with Illumina short reads, we found 2,778 sperm intact long transcript (SpILT) species in mouse. The SpILTs profile is evolutionarily conserved between rodents and primates. mRNAs encoding ribosomal proteins are enriched in SpILTs, and in mice they are sensitive to early trauma. Mouse and human SpILT profiles are determined by a post-transcriptional selection process during spermiogenesis, and are co-retained in sperm with base pair-complementary miRNAs. In sum, we have developed a bioinformatics pipeline to define intact transcripts, added SplLTs into the “sperm RNA code” for use in future research and potential diagnosis, and uncovered selection mechanism(s) controlling sperm RNA profiles.

Contig annotation tool CAT robustly classifies assembled metagenomic contigs and long sequences

10.1101/072868 ◽

2016 ◽

Cited By ~ 13

Author(s):

Diego D. Cambuy ◽

Felipe H. Coutinho ◽

Bas E. Dutilh

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

Taxonomic Classification ◽

Annotation Tool ◽

Single Molecule Sequencing ◽

Short Read ◽

Long Read ◽

Micro Organisms ◽

Taxonomic Annotation

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.

SMRT sequencing reveals differential patterns of methylation in two O111:H- Shiga toxigenicEscherichia coliisolates from a historic hemolytic uremic syndrome outbreak in Australia

10.1101/173336 ◽

2017 ◽

Author(s):

Brian M. Forde ◽

Lauren J. McAllister ◽

James C. Paton ◽

Adrienne W. Paton ◽

Scott A. Beatson

Keyword(s):

Single Molecule ◽

Genomic Analysis ◽

Smrt Sequencing ◽

Promoter Regions ◽

Food Borne ◽

Long Reads ◽

Long Read ◽

Shiga Toxin 2 ◽

Uremic Syndrome ◽

Potential Promoter

AbstractShiga toxigenicEscherichia coli(STEC) are important food-borne pathogens and a major cause of haemorrhagic colitis and haemolytic-uremic syndrome (HUS) worldwide. In 1995 a severe HUS outbreak in Adelaide occurred. A recent genomic analysis of STEC O111:H-strains 95JB1 and 95NR1 from this outbreak found that the more virulent isolate, 95NR1, harboured two additional copies of the Shiga toxin 2 (Stx2) genes although the structure of the Stx2-converting prophages could not be fully resolved due to the fragmented assembly. In this study we have used Pacific Biosciences (PacBio) single molecule real-time (SMRT) long read sequencing to characterise the complete epigenome (genome and methylome) of 95JB1 and 95NR1. Using long reads we completely resolved the structure of two, tandemly inserted, stx2-converting phage in 95NR1. Our analysis of the methylome of 95NR1 and 95JB1 identified hemi-methylation of a novel motif (5’-CTGCm6AG-3’) in more than 4000 sites in the 95NR1 genome. These sites were entirely unmethalyted in the 95JB1, including at least 180 potential promoter regions that could explain regulatory differences between the strains. We identified a Type IIG methyltransferase encoded in both genomes in association with three additional genes in an operon-like arrangement. IS1203mediated disruption of this operon in 95JB1 is the likely cause of the observed differential patterns of methylation between 95NR1 and 95JB1. This study demonstrates the enormous potential of PacBio SMRT sequencing to resolve complex prophage regions and reveal the genetic and epigenetic heterogeneity within a clonal population of bacteria.

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

10.1101/2021.05.18.444610 ◽

2021 ◽

Author(s):

Brandon K. B. Seah ◽

Estienne C. Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Neighboring Element ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Element Elimination

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.

ISOdb: A Comprehensive Database of Full-Length Isoforms Generated by Iso-Seq

International Journal of Genomics ◽

10.1155/2018/9207637 ◽

2018 ◽

Vol 2018 ◽

pp. 1-6 ◽

Cited By ~ 1

Author(s):

Shang-Qian Xie ◽

Yue Han ◽

Xiao-Zhou Chen ◽

Tai-Yu Cao ◽

Kai-Kai Ji ◽

...

Keyword(s):

Single Molecule ◽

Full Length ◽

Public Access ◽

Transcript Isoforms ◽

Sequencing Technologies ◽

Long Reads ◽

Depth Analysis ◽

Gene Level ◽

Long Read ◽

Full Length Transcript

The accurate landscape of transcript isoforms plays an important role in the understanding of gene function and gene regulation. However, building complete transcripts is very challenging for short reads generated using next-generation sequencing. Fortunately, isoform sequencing (Iso-Seq) using single-molecule sequencing technologies, such as PacBio SMRT, provides long reads spanning entire transcript isoforms which do not require assembly. Therefore, we have developed ISOdb, a comprehensive resource database for hosting and carrying out an in-depth analysis of Iso-Seq datasets and visualising the full-length transcript isoforms. The current version of ISOdb has collected 93 publicly available Iso-Seq samples from eight species and presents the samples in two levels: (1) sample level, including metainformation, long read distribution, isoform numbers, and alternative splicing (AS) events of each sample; (2) gene level, including the total isoforms, novel isoform number, novel AS number, and isoform visualisation of each gene. In addition, ISOdb provides a user interface in the website for uploading sample information to facilitate the collection and analysis of researchers’ datasets. Currently, ISOdb is the first repository that offers comprehensive resources and convenient public access for hosting, analysing, and visualising Iso-Seq data, which is freely available.

Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads

10.1101/632703 ◽

2019 ◽

Cited By ~ 1

Author(s):

Laura H. Tung ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Single Molecule ◽

Error Rates ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Transcript Assembly ◽

Novel Isoforms ◽

Generation Sequencing

AbstractThird-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100–4000 more (for 18 human datasets) or 1100–2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4–4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome.

Long-read Sequencing Uncovers a Complex Transcriptome Topology in Varicella Zoster Virus

10.1101/399048 ◽

2018 ◽

Author(s):

István Prazsák ◽

Norbert Moldován ◽

Dóra Tombácz ◽

Klára Megyeri ◽

Attila Szűcs ◽

...

Keyword(s):

Varicella Zoster Virus ◽

Genomic Region ◽

Transcript Isoforms ◽

Sequencing Platform ◽

Varicella Zoster ◽

Rna Molecules ◽

Oxford Nanopore ◽

Complex Structural Variation ◽

Long Read ◽

Complex Structural

AbstractBackgroundVaricella zoster virus (VZV) is a human pathogenic alphaherpesvirus harboring a relatively large DNA molecule. The VZV transcriptome has already been analyzed by microarray and short-read sequencing analyses. However, both approaches have substantial limitations when used for structural characterization of transcript isoforms, even if supplemented with primer extension or other techniques. Among others, they are inefficient in distinguishing between embedded RNA molecules, transcript isoforms, including splice and length variants, as well as between alternative polycistronic transcripts. It has been demonstrated in several studies that long-read sequencing is able to circumvent these problems.ResultsIn this work, we report the analysis of VZV lytic transcriptome using the Oxford Nanopore Technologies sequencing platform. These investigations have led to the identification of 114 novel transcripts, including mRNAs, non-coding RNAs, polycistronic RNAs and complex transcripts, as well as 10 novel spliced transcripts and 27 novel transcription start site isoforms and transcription end site isoforms. A novel class of transcripts, the nroRNAs are described in this study. These transcripts are encoded by the genomic region located in close vicinity to the viral replication origin. We also show that the VZV latency transcript (VLT) exhibits a more complex structural variation than formerly believed. Additionally, we have detected RNA editing in a novel non-coding RNA molecule.ConclusionsOur investigations disclosed a composite transcriptomic architecture of VZV, including the discovery of novel RNA molecules and transcript isoforms, as well as a complex meshwork of transcriptional read-throughs and overlaps. The results represent a substantial advance in the annotation VZV transcriptome and in understanding the molecular biology of the herpesviruses in general.