proovframe: frameshift-correction for long-read (meta)genomics

Long-read sequencing technologies hold big promises for the genomic analysis of complex samples such as microbial communities. Yet, despite improving accuracy, basic gene prediction on long-read data is still often impaired by frameshifts resulting from small indels. Consensus polishing using either complementary short reads or to a lesser extent the long reads themselves can mitigate this effect but requires universally high sequencing depth, which is difficult to achieve in complex samples where the majority of community members are rare. Here we present proovframe, a software implementing an alternative approach to overcome frameshift errors in long-read assemblies and raw long reads. We utilize protein-to-nucleotide alignments against reference databases to pinpoint indels in contigs or reads and correct them by deleting or inserting 1-2 bases, thereby conservatively restoring reading-frame fidelity in aligned regions. Using simulated and real-world benchmark data we show that proovframe performs comparably to short-read-based polishing on assembled data, works well with remote protein homologs, and can even be applied to raw reads directly. Together, our results demonstrate that protein-guided frameshift correction significantly improves the analyzability of long-read data both in combination with and as an alternative to common polishing strategies. Proovframe is available from https://github.com/thackl/proovframe.

Download Full-text

SVIM: Structural Variant Identification using Mapped Long Reads

10.1101/494096 ◽

2018 ◽

Cited By ~ 2

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Single Molecule ◽

Simulated Data ◽

Structural Variants ◽

Human Phenotype ◽

Structural Variant ◽

Small Indels ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read

AbstractMotivationStructural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities.ResultsWe present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines.Availability and implementationThe source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package [email protected]

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Expanding an expanded genome: long-read sequencing ofTrypanosoma cruzi

10.1101/279174 ◽

2018 ◽

Cited By ~ 3

Author(s):

Luisa Berná ◽

Matías Rodríguez ◽

María Laura Chiribao ◽

Adriana Parodi-Talice ◽

Sebastián Pita ◽

...

Keyword(s):

Repetitive Sequences ◽

Gc Content ◽

Gene Copy ◽

Accurate Estimation ◽

Sequencing Technologies ◽

Copy Numbers ◽

Long Reads ◽

Long Read ◽

Large Clusters ◽

Gene Copy Numbers

Although the genome ofTrypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degree of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated withT. cruzi´sgenome since they permit directly determining the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, allows not only accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of twoT. cruziclones: the hybrid TCC (DTU TcVI) and the non-hybrid Dm28c (DTU TcI), determined by PacBio SMRT technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome ofT. cruziis composed of a "core compartment" and a "disruptive compartment" which exhibit opposite gene and GC content composition. New tandem and disperse repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families MUC and trans-sialidases allows now a better overview of these complex groups of genes.

Download Full-text

Complete Genome Resequencing of Thermus thermophilus Strain TMY by Hybrid Assembly of Long- and Short-Read Sequencing Technologies

Microbiology Resource Announcements ◽

10.1128/mra.00979-21 ◽

2021 ◽

Vol 10 (46) ◽

Author(s):

Kentaro Miyazaki ◽

Natsuko Tokito

Keyword(s):

Complete Genome ◽

Thermus Thermophilus ◽

Genomic Analysis ◽

Comparative Genomic ◽

Hybrid Assembly ◽

Genome Resequencing ◽

Short Read ◽

Content Type ◽

Sequencing Technologies ◽

Long Read

Complete genome resequencing was conducted for Thermus thermophilus strain TMY by hybrid assembly of Oxford Nanopore Technologies long-read and MGI short-read data. Errors in the previously reported genome sequence determined by PacBio technology alone were corrected, allowing for high-quality comparative genomic analysis of closely related T. thermophilus genomes.

Download Full-text

Highly accurate long-read HiFi sequencing data for five complex genomes

Scientific Data ◽

10.1038/s41597-020-00743-4 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Ting Hon ◽

Kristin Mars ◽

Greg Young ◽

Yu-Chih Tsai ◽

Joseph W. Karalius ◽

...

Keyword(s):

Sequence Data ◽

Genome Structure ◽

Data Sets ◽

Sequencing Data ◽

Complex Samples ◽

Bioinformatic Tools ◽

Long Reads ◽

Sequencing Method ◽

Sample Data ◽

Long Read

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Download Full-text

Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies

BMC Biology ◽

10.1186/s12915-019-0728-3 ◽

2020 ◽

Vol 18 (1) ◽

Cited By ~ 9

Author(s):

Robert M. Waterhouse ◽

Sergey Aganezov ◽

Yoann Anselmetti ◽

Jiyoung Lee ◽

Livio Ruzzante ◽

...

Keyword(s):

Anopheles Funestus ◽

Genomic Analysis ◽

Comparative Genomic ◽

Financial Barriers ◽

Complementary Method ◽

Gene Synteny ◽

Sequencing Technologies ◽

Complementary Approach ◽

Long Read ◽

Gene Order Conservation

Abstract Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from ‘finished’. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. Results We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

LoRTE: Detecting transposon-induced genomic variants using low coverage PacBio long read sequences

10.1101/073551 ◽

2016 ◽

Author(s):

Eric Disdero ◽

Jonathan Filée

Keyword(s):

Transposable Elements ◽

Reference Genome ◽

Genomic Analysis ◽

Bioinformatic Tools ◽

Sequencing Technologies ◽

Population Genomic ◽

Long Read ◽

Different Strains ◽

Low Coverage ◽

Ncbi Blast

AbstractMotivationPopulation genomic analysis of transposable elements has greatly benefited from recent advances of sequencing technologies. However, the propensity of transposable elements to nest in highly repeated regions of genomes limits the efficiency of bioinformatic tools when short read sequences technology is used.ResultsLoRTE is the first tool able to use PacBio long read sequences to identify transposon deletions and insertions between a reference genome and genomes of different strains or populations. Tested against Drosophila melanogaster PacBio datasets, LoRTE appears to be a reliable and broadly applicable tools to study the dynamic and evolutionary impact of transposable elements using low coverage, long read sequences.Availability and ImplementationLoRTE is available at http://www.egce.cnrs-gif.fr/?p=6422. It is written in Python 2.7 and only requires the NCBI BLAST + package. LoRTE can be used on standard computer with limited RAM resources and reasonable running time even with large [email protected]

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text