NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy

Willem de Koning; Milad Miladi; Saskia Hiltemann; Astrid Heikema; John P Hays; Stephan Flemming; Marius van den Beek; Dana A Mustafa; Rolf Backofen; Björn Grüning; Andrew P Stubbs

doi:10.1093/gigascience/giaa105

NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy

GigaScience ◽

10.1093/gigascience/giaa105 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 1

Author(s):

Willem de Koning ◽

Milad Miladi ◽

Saskia Hiltemann ◽

Astrid Heikema ◽

John P Hays ◽

...

Keyword(s):

Genome Assembly ◽

Bioinformatics Analysis ◽

De Novo ◽

Sequence Data ◽

Ease Of Use ◽

Easy Access ◽

Complex Data ◽

Sequencing Data ◽

Long Read ◽

Sequencing Platforms

Abstract Background Long-read sequencing can be applied to generate very long contigs and even completely assembled genomes at relatively low cost and with minimal sample preparation. As a result, long-read sequencing platforms are becoming more popular. In this respect, the Oxford Nanopore Technologies–based long-read sequencing “nanopore" platform is becoming a widely used tool with a broad range of applications and end-users. However, the need to explore and manipulate the complex data generated by long-read sequencing platforms necessitates accompanying specialized bioinformatics platforms and tools to process the long-read data correctly. Importantly, such tools should additionally help democratize bioinformatics analysis by enabling easy access and ease-of-use solutions for researchers. Results The Galaxy platform provides a user-friendly interface to computational command line–based tools, handles the software dependencies, and provides refined workflows. The users do not have to possess programming experience or extended computer skills. The interface enables researchers to perform powerful bioinformatics analysis, including the assembly and analysis of short- or long-read sequence data. The newly developed “NanoGalaxy" is a Galaxy-based toolkit for analysing long-read sequencing data, which is suitable for diverse applications, including de novo genome assembly from genomic, metagenomic, and plasmid sequence reads. Conclusions A range of best-practice tools and workflows for long-read sequence genome assembly has been integrated into a NanoGalaxy platform to facilitate easy access and use of bioinformatics tools for researchers. NanoGalaxy is freely available at the European Galaxy server https://nanopore.usegalaxy.eu with supporting self-learning training material available at https://training.galaxyproject.org.

Download Full-text

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Nature Biotechnology ◽

10.1038/s41587-020-0719-5 ◽

2020 ◽

Author(s):

David Porubsky ◽

◽

Peter Ebert ◽

Peter A. Audano ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Single Cell ◽

Genome Assembly ◽

De Novo ◽

Error Rates ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Parental Data ◽

Human Genome Assembly ◽

Long Read

AbstractHuman genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Download Full-text

Gene Annotation and Transcriptome Delineation on a de novo Genome Assembly for the Reference Leishmania major Friedlin Strain

10.20944/preprints202107.0562.v1 ◽

2021 ◽

Author(s):

Esther Camacho ◽

Sandra González-de la Fuente ◽

Jose C. Solana ◽

Alberto Rastrojo ◽

Fernando Carrasco-Ramiro ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Leishmania Major ◽

De Novo ◽

Gene Annotation ◽

Leishmania Species ◽

Long Read ◽

Sequencing Platforms

Leishmania major is the main causative agent of cutaneous leishmaniasis in humans. The Friedlin strain of this species (LmjF) was chosen when a multi-laboratory consortium undertook the objective of deciphering the first genome sequence for a parasite of the genus Leishmania. The objective was successfully attained in 2005, and this represented a milestone for Leishmania molecular biology studies around the world. Although the LmjF genome sequence was done following a shotgun strategy and using classical Sanger sequencing, the results were excellent and this genome assembly served as the reference for subsequent genome assemblies in other Leishmania species. Here, we present a new assembly for the genome of this strain (named LMJFC for clarity), generated by the combination of two high throughput sequencing platforms, Illumina short-read sequencing and PacBio Single Molecular Real-Time (SMRT) sequencing, which provides long-read sequences. Apart from resolving uncertain nucleotide positions, several genomic regions have been reorganized and a more precise composition of tandemly repeated gene loci was attained. Additionally, the genome annotation has been improved by adding 542 genes and more accurate coding-sequences defined for around two hundred genes, based on the transcriptome delimitation also carried out in this work. As a result, we are providing gene models (including untranslated regions and introns) for 11,238 genes. Genomic information ultimately determines the biology of every organism; therefore, our understanding of molecular mechanisms will depend on the availability of precise genome sequences and accurate gene annotations. In this regards, this work is providing an improved genome sequence and updated transcriptome annotations for the reference L. major Friedlin strain.

Download Full-text

A reference genome sequence resource for the sugar beet root rot pathogen Aphanomyces cochlioides

10.1101/2021.08.11.456025 ◽

2021 ◽

Author(s):

Jacob Botkin ◽

Ashok K Chanda ◽

Frank N Martin ◽

Cory D Hirsch

Keyword(s):

Sugar Beet ◽

Genome Assembly ◽

Root Rot ◽

De Novo ◽

Sequence Data ◽

Aphanomyces Cochlioides ◽

Beet Root ◽

Long Read ◽

Illumina Sequence ◽

Sugar Beet Root

Aphanomyces cochlioides, the causal agent of damping-off and root rot of sugar beet (Beta vulgaris L.), is a soil-dwelling oomycete responsible for yield losses in all major sugar beet growing regions. Currently, genomic resources for A. cochlioides are limited. Here we report a de novo genome assembly using a combination of long-read MinION (Oxford Nanopore Technologies) and short-read Illumina sequence data for A. cochlioides isolate 103-1, from Breckenridge, MN. The assembled genome was 76.3 Mb, with a contig N50 of 2.6 Mb. The reference assembly was annotated and was composed of 32.1% repetitive elements and 20,274 gene models. This high-quality genome assembly of A. cochlioides will be a valuable resource for understanding genetic variation, virulence factors, and comparative genomics of this important sugar beet pathogen.

Download Full-text

GALA: gap-free chromosome-scale assembly with long reads

10.1101/2020.05.15.097428 ◽

2020 ◽

Author(s):

Mohamed Awad ◽

Xiangchao Gan

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genetic Maps ◽

Sequencing Data ◽

C Elegans ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Assembly Technology

AbstractHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we propose GALA (Gap-free long-read assembler), a chromosome-by-chromosome assembly method implemented through a multi-layer computer graph that identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates a gap-free assembly free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, a reference genome and even motif analyses, to generate gap-free chromosome-scale assemblies. We de novo assembled the C. elegans and A. thaliana genomes using combined Pacbio and Nanopore sequencing data from publicly available datasets. We also demonstrated the new method’s applicability with a gap-free assembly of a human genome with the help a reference genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.

Download Full-text

Nanopore sequencing enables near-complete de novo assembly of Saccharomyces cerevisiae reference strain CEN.PK113-7D

10.1101/175984 ◽

2017 ◽

Author(s):

Alex N. Salazar ◽

Arthur R. Gorter de Vries ◽

Marcel van den Broek ◽

Melanie Wijsman ◽

Pilar de la Torre Cortés ◽

...

Keyword(s):

Saccharomyces Cerevisiae ◽

Genome Assembly ◽

Genome Stability ◽

De Novo ◽

Growth Conditions ◽

Industrial Applications ◽

Sequencing Data ◽

Sequencing Platform ◽

Long Read ◽

2 Micron Plasmid

AbstractThe haploid Saccharomyces cerevisiae strain CEN.PK113-7D is a popular model system for metabolic engineering and systems biology research. Current genome assemblies are based on short-read sequencing data scaffolded based on homology to strain S288C. However, these assemblies contain large sequence gaps, particularly in subtelomeric regions, and the assumption of perfect homology to S288C for scaffolding introduces bias.In this study, we obtained a near-complete genome assembly of CEN.PK113-7D using only Oxford Nanopore Technology’s MinION sequencing platform. 15 of the 16 chromosomes, the mitochondrial genome, and the 2-micron plasmid are assembled in single contigs and all but one chromosome starts or ends in a telomere cap. This improved genome assembly contains 770 Kbp of added sequence containing 248 gene annotations in comparison to the previous assembly of CEN.PK113-7D. Many of these genes encode functions determining fitness in specific growth conditions and are therefore highly relevant for various industrial applications. Furthermore, we discovered a translocation between chromosomes III and VIII which caused misidentification of a MAL locus in the previous CEN.PK113-7D assembly. This study demonstrates the power of long-read sequencing by providing a high-quality reference assembly and annotation of CEN.PK113-7D and places a caveat on assumed genome stability of microorganisms.

Download Full-text

Using MinION nanopore sequencing to generate a de novo eukaryotic draft genome: preliminary physiological and genomic description of the extremophilic red alga Galdieria sulphuraria strain SAG 107.79

10.1101/076208 ◽

2016 ◽

Cited By ~ 5

Author(s):

Amanda M. Davis ◽

Manuela Iovinella ◽

Sally James ◽

Thomas Robshaw ◽

Jennifer R. Dodson ◽

...

Keyword(s):

Dna Sequences ◽

Genome Assembly ◽

De Novo ◽

Sequence Data ◽

Draft Genome ◽

Carbon Sources ◽

Red Alga ◽

Full Genome Sequencing ◽

Galdieria Sulphuraria ◽

Long Read

AbstractWe report here the de novo assembly of a eukaryotic genome using only MinION nanopore DNA sequence data by examining a novel Galdieria sulphuraria genome: strain SAG 107.79. This extremophilic red alga was targeted for full genome sequencing as we found that it could grow on a wide variety of carbon sources and could uptake several precious and rare-earth metals, which places it as an interesting biological target for disparate industrial biotechnological uses. Phylogenetic analysis clearly places this as a species of G. sulphuraria. Here we additionally show that the genome assembly generated via nanopore long read data was of a high quality with regards to low total number of contiguous DNA sequences and long length of assemblies. Collectively, the MinION platform looks to rival other competing approaches for de novo genome acquisition with available informatics tools for assembly. The genome assembly is publically released as NCBI BioProject PRJNA330791. Further work is needed to reduce small insertion-deletion errors, relative to short-read assemblies.

Download Full-text

Long read-based de novo assembly of low complex metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system

10.1101/476747 ◽

2018 ◽

Cited By ~ 2

Author(s):

Vincent Somerville ◽

Stefanie Lutz ◽

Michael Schmid ◽

Daniel Frei ◽

Aline Moser ◽

...

Keyword(s):

Microbial Communities ◽

Genome Assembly ◽

De Novo ◽

Shotgun Sequencing ◽

Third Generation ◽

Sequencing Data ◽

Bacterial Genomes ◽

Functional Profiling ◽

Long Read ◽

Low Complex

AbstractBackgroundComplete and contiguous genome assemblies greatly improve the quality of subsequent systems-wide functional profiling studies and the ability to gain novel biological insights. While a de novo genome assembly of an isolated bacterial strain is in most cases straightforward, more informative data about co-existing bacteria as well as synergistic and antagonistic effects can be obtained from a direct analysis of microbial communities. However, the complexity of metagenomic samples represents a major challenge. While third generation sequencing technologies have been suggested to enable finished metagenome-assembled-genomes, to our knowledge, the complete genome assembly of all dominant strains in a microbiome sample has not been shown so far. Natural whey starter cultures (NWCs) are used in the production of cheese and represent low complex microbiomes. Previous studies of Swiss Gruyère and selected Italian hard cheeses, mostly based on amplicon-based metagenomics, concurred that three species generally pre-dominate: Streptococcus thermophilus, Lactobacillus helveticus and Lactobacillus delbrueckii.ResultsTwo NWCs from Swiss Gruyère producers were subjected to whole metagenome shotgun sequencing using Pacific Biosciences Sequel, Oxford Nanopore Technologies MinION and Illumina MiSeq platforms. We achieved the complete assembly of all dominant bacterial genomes from these low complex NWCs, which was corroborated by a 16S rRNA based amplicon survey. Moreover, two distinct L. helveticus strains were successfully co-assembled from the same sample. Besides bacterial genomes, we could also assemble several bacterial plasmids as well as phages and a corresponding prophage. Biologically relevant insights could be uncovered by linking the plasmids and phages to their respective host genomes using DNA methylation motifs on the plasmids and by matching prokaryotic CRISPR spacers with the corresponding protospacers on the phages. These results could only be achieved by employing third generation, long-read sequencing data able to span intragenomic as well as intergenomic repeats.ConclusionsHere, we demonstrate the feasibility of complete de novo genome assembly of all dominant strains from low complex NWC’s based on whole metagenomics shotgun sequencing data. This allowed to gain novel biological insights and is a fundamental basis for subsequent systems-wide omic analyses, functional profiling and phenotype to genotype analysis of specific microbial communities.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data

Briefings in Bioinformatics ◽

10.1093/bib/bbx147 ◽

2017 ◽

Vol 20 (3) ◽

pp. 866-876 ◽

Cited By ~ 30

Author(s):

Vasanthan Jayakumar ◽

Yasubumi Sakakibara

Keyword(s):

Genome Assembly ◽

Comprehensive Evaluation ◽

Sequence Data ◽

Third Generation ◽

Hybrid Genome ◽

Long Read

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text