lra: A long read aligner for sequences and contigs

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).

Download Full-text

lra: the Long Read Aligner for Sequences and Contigs

10.1101/2020.11.15.383273 ◽

2020 ◽

Author(s):

Jingwen Ren ◽

Mark JP Chaisson

Keyword(s):

Single Molecule ◽

De Novo ◽

Data Types ◽

Single Molecule Sequencing ◽

Detection Algorithms ◽

Link Type ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Linear Cost

AbstractMotivationIt is computationally challenging to detect variation by aligning long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies. One approach to efficiently align long sequences is sparse dynamic programming (SDP), where exact matches are found between the sequence and the genome, and optimal chains of matches are found representing a rough alignment. Sequence variation is more accurately modeled when alignments are scored with a gap penalty that is a convex function of the gap length. Because previous implementations of SDP used a linear-cost gap function that does not accurately model variation, and implementations of alignment that have a convex gap penalty are either inefficient or use heuristics, we developed a method, lra, that uses SDP with a convex-cost gap penalty. We use lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs.ResultsAcross all data types, the runtime of lra is between 52-168% of the state of the art aligner minimap2 when generating SAM alignment, and 9-15% of an alternative method, ngmlr. This alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms. The number of calls discovered using pbsv with lra alignments are within 98.3-98.6% of calls made from minimap2 alignments on the same data, and give a nominal 0.2-0.4% increase in F1 score by Truvari analysis. On ONT data with SV called using Sniffles, the number of calls made from lra alignments is 3% greater than minimap2-based calls, and 30% greater than ngmlr based calls, with a 4.6-5.5% increase in Truvari F1 score. When applied to calling variation from de novo assembly contigs, there is a 5.8% increase in SV calls compared to minimap2+paftools, with a 4.3% increase in Truvari F1 score.Availability and implementationAvailable in bioconda: https://anaconda.org/bioconda/lra and github: https://github.com/ChaissonLab/[email protected], [email protected]

Download Full-text

Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome

10.1101/013490 ◽

2015 ◽

Cited By ~ 23

Author(s):

Sara Goodwin ◽

James Gurtowski ◽

Scott Ethe-Sayers ◽

Panchajanya Deshpande ◽

Michael Schatz ◽

...

Keyword(s):

Error Correction ◽

De Novo Assembly ◽

De Novo ◽

Correction Algorithm ◽

Membrane Pore ◽

Complete Representation ◽

Oxford Nanopore ◽

Long Read ◽

Error Correction Algorithm ◽

Sequencing Instrument

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

Download Full-text

De novo assembly of the cattle reference genome with single-molecule sequencing

GigaScience ◽

10.1093/gigascience/giaa021 ◽

2020 ◽

Vol 9 (3) ◽

Cited By ~ 35

Author(s):

Benjamin D Rosen ◽

Derek M Bickhart ◽

Robert D Schnabel ◽

Sergey Koren ◽

Christine G Elsik ◽

...

Keyword(s):

Single Molecule ◽

De Novo Assembly ◽

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Future Research ◽

Protein Coding ◽

Single Molecule Sequencing ◽

Assembly Accuracy ◽

Genomic Tools

Abstract Background Major advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10–12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies. Results We present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use. Conclusions We demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species.

Download Full-text

A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production.

F1000Research ◽

10.12688/f1000research.55301.2 ◽

2021 ◽

Vol 10 ◽

pp. 961

Author(s):

Kevin McKernan ◽

Liam Kane ◽

Yvonne Helbert ◽

Lei Zhang ◽

Nathan Houde ◽

...

Keyword(s):

Gene Cluster ◽

Single Molecule ◽

Reference Genome ◽

De Novo ◽

Genomic Diversity ◽

Sequence Coverage ◽

Single Molecule Sequencing ◽

Contiguous Gene ◽

Long Read ◽

Interesting Variation

The Psilocybe genus is well known for the synthesis of valuable psychoactive compounds such as Psilocybin, Psilocin, Baeocystin and Aeruginascin. The ubiquity of Psilocybin synthesis in Psilocybe has been attributed to a horizontal gene transfer mechanism of a ~20Kb gene cluster. A recently published highly contiguous reference genome derived from long read single molecule sequencing has underscored interesting variation in this Psilocybin synthesis gene cluster. This reference genome has also enabled the shotgun sequencing of spores from many Psilocybe strains to better catalog the genomic diversity in the Psilocybin synthesis pathway. Here we present the de novo assembly of 81 Psilocybe genomes compared to the P.envy reference genome. Surprisingly, the genomes of Psilocybe galindoi, Psilocybe tampanensis and Psilocybe azurescens lack sequence coverage over the previously described Psilocybin synthesis pathway but do demonstrate amino acid sequence homology to a less contiguous gene cluster and may illuminate the previously proposed evolution of psilocybin synthesis.

Download Full-text

AsmMix: A pipeline for high quality diploid de novo assembly

10.1101/2021.01.15.426893 ◽

2021 ◽

Author(s):

Pei Wu ◽

Chao Liu ◽

Ou Wang ◽

Xia Zhao ◽

Fang Chen ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Variant Calling ◽

The Other ◽

Second Step ◽

Small Scale ◽

Mixing Process ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

Download Full-text

Genomic Surveillance for Antimicrobial Resistance inMannheimia haemolyticaUsing Nanopore Single Molecule Sequencing Technology

10.1101/395087 ◽

2018 ◽

Author(s):

Alexander Lim ◽

Bryan Naidenov ◽

Haley Bates ◽

Karyn Willyerd ◽

Timothy Snider ◽

...

Keyword(s):

Antibiotic Resistance ◽

Antimicrobial Resistance ◽

Single Molecule ◽

Resistant Strain ◽

De Novo ◽

Gene Annotation ◽

Cost Effective ◽

Multidrug Resistant ◽

Oxford Nanopore ◽

Long Read

AbstractDisruptive innovations in long-range, cost-effective direct template nucleic acid sequencing are transforming clinical and diagnostic medicine. A multidrug resistant strain and a pan-susceptible strain ofMannheimia haemolytica, isolated from pneumonic bovine lung samples, were respectively sequenced at 146x and 111x coverage with Oxford Nanopore Technologies MinION.De novoassembly produced a complete genome for the non-resistant strain and a nearly complete assembly for the drug resistant strain. Functional annotation using RAST (Rapid Annotations using Subsystems Technology), CARD (Comprehensive Antibiotic Resistance Database) and ResFinder databases identified genes conferring resistance to different classes of antibiotics including beta lactams, tetracyclines, lincosamides, phenicols, aminoglycosides, sulfonamides and macrolides. Antibiotic resistance phenotypes of theM. haemolyticastrains were confirmed with minimum inhibitory concentration (MIC) assays. The sequencing capacity of highly portable MinION devices was verified by sub-sampling sequencing reads; potential for antimicrobial resistance determined by identification of resistance genes in the draft assemblies with as little as 5,437 MinION reads corresponded to all classes of MIC assays. The resulting quality assemblies and AMR gene annotation highlight efficiency of ultra long-read, whole-genome sequencing (WGS) as a valuable tool in diagnostic veterinary medicine.

Download Full-text

Genetic Findings of Sanger and Nanopore Single-Molecule Sequencing in Patients with X-Linked Hearing Loss and Incomplete Partition Type III

10.21203/rs.3.rs-501574/v1 ◽

2021 ◽

Author(s):

Ying Chen ◽

Jiajun Qiu ◽

Yingwei Wu ◽

Huan Jia ◽

Yi Jiang ◽

...

Keyword(s):

Single Molecule ◽

Hearing Aids ◽

Sanger Sequencing ◽

Clinical Characteristics ◽

De Novo ◽

Genetic Mutations ◽

Single Molecule Sequencing ◽

Long Read ◽

Incomplete Partition ◽

The Mean

Abstract BackgroundPOU3F4 is the causative gene for X-linked deafness-2 (DFNX2), characterized by incomplete partition type III (IP-III) malformation of the inner ear. The aim of this study was to investigate the clinical characteristics and molecular findings by Sanger or Nanopore single-molecule sequencing in IP-III patients. MethodsDiagnosis of IP-III was mainly based on clinical characteristics including radiological and audiological findings. Sanger sequencing of POU3F4 were carried out for these IP-III patients. For those patients with negative results for POU3F4 Sanger sequencing, Nanopore long-read single-molecule sequencing was used to identify the possible pathogenic variants. Hearing intervention outcomes of hearing aids fitting and cochlear implantation were also analyzed. Grouped by different locations of POU3F4 variants, aided PTA was further compared between patients in whom the variants located in the exon region or in the upstream region.ResultsIn total, 18 male patients from 14 unrelated families were diagnosed with IP-III. 10 variants were identified in POU3F4 by Sanger sequencing and 9 of these were novel (p.Val321Gly, p.Gln181*, p.Cys233*, p.Val215Gly, p.Arg282Gln, p.Trp57*, p.Gln316*, c.903_912 delins TGCCA and p.Arg205del). Four different deletions (DELs) that varied from 80 to 486 kb were identified 876-1503 kb upstream of POU3F4 by Nanopore long-read single-molecule sequencing. Of them, de novo genetic mutations occurred in 21.4% (3/14) of patients with POU3F4 mutations. Of these 18 patients, 7 had bilateral hearing aids (HAs) and 10 patients received unilateral cochlear implantation (CI). The mean aided pure tone average (PTA) for HAs and CI users were 41.1±5.18 and 40.3±7.59 dB HL respectively. The mean PTAs for whom the variants located in the exon and upstream regions were 39.6±6.31 vs 43.0±7.10 dB HL, which presented no significant difference (p=0.342).ConclusionsAmong IP-III patients, 28.6% (4/14) had no definite mutation in exon region of POU3F4, however, possible pathogenic deletions were identified in upstream region of this gen. De novo genetic mutations occurred in 21.4% (3/14) of patients with POU3F4 mutation. Hearing intervention outcomes of IP-III patients presented no difference regardless of the variants locations on exon or upstream regions.

Download Full-text

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

10.1101/008003 ◽

2014 ◽

Cited By ~ 13

Author(s):

Konstantin Berlin ◽

Sergey Koren ◽

Chen-Shan Chin ◽

James Drake ◽

Jane M Landolin ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Locality Sensitive Hashing ◽

Model Organisms ◽

Smrt Sequencing ◽

High Coverage ◽

Celera Assembler ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Download Full-text

De Novo Assembly of the Streptomyces sp. Strain Mg1 Genome Using PacBio Single-Molecule Sequencing

Genome Announcements ◽

10.1128/genomea.00535-13 ◽

2013 ◽

Vol 1 (4) ◽

Cited By ~ 17

Author(s):

B. C. Hoefler ◽

K. Konganti ◽

P. D. Straight

Keyword(s):

Single Molecule ◽

De Novo Assembly ◽

De Novo ◽

Single Molecule Sequencing ◽

Streptomyces Sp

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text