Overlapping long sequence reads: Current innovations and challenges in developing sensitive, specific and scalable algorithms

AbstractIdentifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap, and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency, and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. We benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing [email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

10.1101/2021.05.18.444610 ◽

2021 ◽

Author(s):

Brandon K. B. Seah ◽

Estienne C. Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Neighboring Element ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Element Elimination

Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, and also the ability to detect correlations of neighboring element elimination. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation: BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Contact: [email protected] Supplementary information: Benchmarking of BleTIES with published sequence data.

Download Full-text

Robust Benchmark Structural Variant Calls of An Asian Using the State-of-Art Long Fragment Sequencing Technologies

10.1101/2020.08.10.245308 ◽

2020 ◽

Author(s):

Xiao Du ◽

Lili Li ◽

Fan Liang ◽

Sanyang Liu ◽

Wenxin Zhang ◽

...

Keyword(s):

B Lymphocyte ◽

De Novo ◽

Structural Variants ◽

High Confidence ◽

False Negatives ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Circular Consensus Sequencing

AbstractThe importance of structural variants (SVs) on phenotypes and human diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of those approaches, our work established an Asian reference material comprising identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8,938 SVs in an EBV immortalized B lymphocyte line, by integrating four alignment-based SV callers [from 109× PacBio continuous long read (CLR), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore long reads, and 114× optical mapping platform (Bionano)] and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR and Sanger sequencing, proofing the robustness of our SV calls. Combining trio-binning based haplotype assemblies, we established an SV benchmark for identification of false negatives and false positives by constructing the continuous high confident regions (CHCRs), which cover 1.46Gb and 6,882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical diagnosis.

Download Full-text

SVJedi: genotyping structural variations with long reads

Bioinformatics ◽

10.1093/bioinformatics/btaa527 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4568-4575

Author(s):

Lolita Lecompte ◽

Pierre Peterlongo ◽

Dominique Lavenier ◽

Claire Lemaitre

Keyword(s):

Supplementary Information ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Clinical Diagnoses ◽

Long Read ◽

The One

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BleTIES: Annotation of natural genome editing in ciliates using long read sequencing

Bioinformatics ◽

10.1093/bioinformatics/btab613 ◽

2021 ◽

Author(s):

Brandon K B Seah ◽

Estienne C Swart

Keyword(s):

Dna Sequences ◽

Sequence Data ◽

Low Complexity ◽

Supplementary Information ◽

Software Toolkit ◽

Assembly Strategy ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read

Abstract Summary Ciliates are single-celled eukaryotes that eliminate specific, interspersed DNA sequences (internally eliminated sequences, IESs) from their genomes during development. These are challenging to annotate and assemble because IES-containing sequences are typically much less abundant in the cell than those without, and IES sequences themselves often contain repetitive and low-complexity sequences. Long read sequencing technologies from Pacific Biosciences and Oxford Nanopore have the potential to reconstruct longer IESs than has been possible with short reads, but require a different assembly strategy. Here we present BleTIES, a software toolkit for detecting, assembling, and analyzing IESs using mapped long reads. Availability and implementation BleTIES is implemented in Python 3. Source code is available at https://github.com/Swart-lab/bleties (MIT license), and also distributed via Bioconda. Supplementary information Benchmarking of BleTIES with published sequence data.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De novo Assembly of the Brugia malayi Genome Using Long Reads from a Single MinION Flowcell

Scientific Reports ◽

10.1038/s41598-019-55908-y ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Joseph R. Fauver ◽

John Martin ◽

Gary J. Weil ◽

Makedonka Mitreva ◽

Peter U. Fischer

Keyword(s):

Single Molecule ◽

New Technologies ◽

Reference Genome ◽

De Novo ◽

Complete Mitochondrial Genome ◽

Nuclear Genome ◽

Brugia Malayi ◽

Field Isolates ◽

Sequencing Technologies ◽

Long Reads

AbstractFilarial nematode infections cause a substantial global disease burden. Genomic studies of filarial worms can improve our understanding of their biology and epidemiology. However, genomic information from field isolates is limited and available reference genomes are often discontinuous. Single molecule sequencing technologies can reduce the cost of genome sequencing and long reads produced from these devices can improve the contiguity and completeness of genome assemblies. In addition, these new technologies can make generation and analysis of large numbers of field isolates feasible. In this study, we assessed the performance of the Oxford Nanopore Technologies MinION for sequencing and assembling the genome of Brugia malayi, a human parasite widely used in filariasis research. Using data from a single MinION flowcell, a 90.3 Mb nuclear genome was assembled into 202 contigs with an N50 of 2.4 Mb. This assembly covered 96.9% of the well-defined B. malayi reference genome with 99.2% identity. The complete mitochondrial genome was obtained with individual reads and the nearly complete genome of the endosymbiotic bacteria Wolbachia was assembled alongside the nuclear genome. Long-read data from the MinION produced an assembly that approached the quality of a well-established reference genome using comparably fewer resources.

Download Full-text

PhaseME: Automatic rapid assessment of phasing quality and phasing improvement

GigaScience ◽

10.1093/gigascience/giaa078 ◽

2020 ◽

Vol 9 (7) ◽

Cited By ~ 1

Author(s):

Sina Majidian ◽

Fritz J Sedlazeck

Keyword(s):

Rapid Assessment ◽

Linkage Data ◽

Universal Method ◽

High Quality ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Linkage Information ◽

Quality Assessments

Abstract Background The detection of which mutations are occurring on the same DNA molecule is essential to predict their consequences. This can be achieved by phasing the genomic variations. Nevertheless, state-of-the-art haplotype phasing is currently a black box in which the accuracy and quality of the reconstructed haplotypes are hard to assess. Findings Here we present PhaseME, a versatile method to provide insights into and improvement of sample phasing results based on linkage data. We showcase the performance and the importance of PhaseME by comparing phasing information obtained from Pacific Biosciences including both continuous long reads and high-quality consensus reads, Oxford Nanopore Technologies, 10x Genomics, and Illumina sequencing technologies. We found that 10x Genomics and Oxford Nanopore phasing can be significantly improved while retaining a high N50 and completeness of phase blocks. PhaseME generates reports and summary plots to provide insights into phasing performance and correctness. We observed unique phasing issues for each of the sequencing technologies, highlighting the necessity of quality assessments. PhaseME is able to decrease the Hamming error rate significantly by 22.4% on average across all 5 technologies. Additionally, a significant improvement is obtained in the reduction of long switch errors. Especially for high-quality consensus reads, the improvement is 54.6% in return for only a 5% decrease in phase block N50 length. Conclusions PhaseME is a universal method to assess the phasing quality and accuracy and improves the quality of phasing using linkage information. The package is freely available at https://github.com/smajidian/phaseme.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text