LazyB: fast and cheap genome assembly

Abstract Background Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. Results is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of , entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Conclusions is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. Availability The prototype is available at https://github.com/TGatter/LazyB.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

ESCA pipeline: Easy-to-use SARS-CoV-2 genome Assembler

10.1101/2021.05.21.445156 ◽

2021 ◽

Author(s):

Martina Rueca ◽

Emanuela Giombini ◽

Francesco Messina ◽

Barbara Bartolini ◽

Antonino Di Caro ◽

...

Keyword(s):

Amino Acid ◽

Genome Assembly ◽

Global Level ◽

Sequencing Data ◽

High Quality ◽

Rapid Succession ◽

Novel Variants ◽

Low Coverage ◽

High Quality Genome ◽

Genome Assembler

Early sequencing and quick analysis of SARS-CoV-2 genome are contributing to un-derstand the dynamics of COVID19 epidemics and to countermeasures design at global level. Amplicon-based NGS methods are widely used to sequence the SARS-CoV-2 genome and to identify novel variants that are emerging in rapid succession, harboring multiple deletions and amino acid changing mutations. To facilitate the analysis of NGS sequencing data obtained from amplicon-based sequencing methods, here we propose an easy-to-use SARS-CoV-2 genome Assembler: the ESCA pipeline. Results showed that ESCA can perform high quality genome assembly from IonTor-rent and Illumina raw data, and help the user in easily correct low-coverage regions. Moreover, ESCA includes the possibility to compare assembled genomes of multi sample runs through an easy table format.

Download Full-text

Raven: a de novo genome assembler for long reads

10.1101/2020.08.07.242461 ◽

2020 ◽

Cited By ~ 5

Author(s):

Robert Vaser ◽

Mile Šikić

Keyword(s):

Human Genome ◽

Genome Assembly ◽

De Novo ◽

De Novo Genome Assembly ◽

New Methods ◽

Long Reads ◽

Long Read ◽

Comparable Accuracy ◽

Genome Assembler ◽

Genome Dataset

We present new methods for the improvement of long-read de novo genome assembly incorporated into a straightforward tool called Raven (https://github.com/lbcb-sci/raven). Compared with other assemblers, Raven is one of two fastest, it reconstructs the sequenced genome in the least amount of fragments, has better or comparable accuracy, and maintains similar performance for various genomes. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.

Download Full-text

GALA: gap-free chromosome-scale assembly with long reads

10.1101/2020.05.15.097428 ◽

2020 ◽

Author(s):

Mohamed Awad ◽

Xiangchao Gan

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genetic Maps ◽

Sequencing Data ◽

C Elegans ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Assembly Technology

AbstractHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we propose GALA (Gap-free long-read assembler), a chromosome-by-chromosome assembly method implemented through a multi-layer computer graph that identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates a gap-free assembly free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, a reference genome and even motif analyses, to generate gap-free chromosome-scale assemblies. We de novo assembled the C. elegans and A. thaliana genomes using combined Pacbio and Nanopore sequencing data from publicly available datasets. We also demonstrated the new method’s applicability with a gap-free assembly of a human genome with the help a reference genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.

Download Full-text

Economic Genome Assembly from Low Coverage Illumina and Nanopore Data

10.1101/2020.02.07.939454 ◽

2020 ◽

Author(s):

Thomas Gatter ◽

Sarah von Löhneysen ◽

Polina Drozdova ◽

Tom Hartmann ◽

Peter F. Stadler

Keyword(s):

Genomic Sequence ◽

State Of The Art ◽

Fruit Fly ◽

Computational Effort ◽

Maximum Weight ◽

New Approach ◽

Short Read ◽

Long Reads ◽

Long Read ◽

Low Coverage

AbstractWe describe a new approach to assemble genomes from a combination of low-coverage short and long reads. LazyBastard starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs, which are then reduced to a long-read overlap graph G. Edges are removed from G to obtain first a consistent orientation and then a DAG. Using heuristics based on properties of proper interval graphs, contigs are extracted as maximum weight paths. These are translated into genomic sequence only in the final step. A prototype implementation of LazyBastard, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort.FundingRSF / Helmholtz Association 18-44-06201; Deutsche Academische Austauschdienst, DFG STA 850/19-2 within SPP 1738; German Federal Ministery of Education an Research 031A538A, de.NBI-RBC

Download Full-text

SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme

BMC Bioinformatics ◽

10.1186/s12859-021-04081-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lidong Guo ◽

Mengyang Xu ◽

Wenchao Wang ◽

Shengqiang Gu ◽

Xia Zhao ◽

...

Keyword(s):

High Efficiency ◽

De Novo ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Draft Assembly ◽

Screening Algorithm ◽

Long Reads ◽

Hybrid Genome ◽

Genomics Research ◽

Negative Effect

Abstract Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.

Download Full-text

Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data

Briefings in Bioinformatics ◽

10.1093/bib/bbx147 ◽

2017 ◽

Vol 20 (3) ◽

pp. 866-876 ◽

Cited By ~ 30

Author(s):

Vasanthan Jayakumar ◽

Yasubumi Sakakibara

Keyword(s):

Genome Assembly ◽

Comprehensive Evaluation ◽

Sequence Data ◽

Third Generation ◽

Hybrid Genome ◽

Long Read

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

Highly accurate long-read HiFi sequencing data for five complex genomes

Scientific Data ◽

10.1038/s41597-020-00743-4 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Ting Hon ◽

Kristin Mars ◽

Greg Young ◽

Yu-Chih Tsai ◽

Joseph W. Karalius ◽

...

Keyword(s):

Sequence Data ◽

Genome Structure ◽

Data Sets ◽

Sequencing Data ◽

Complex Samples ◽

Bioinformatic Tools ◽

Long Reads ◽

Sequencing Method ◽

Sample Data ◽

Long Read

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Download Full-text

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Nature Biotechnology ◽

10.1038/s41587-020-0719-5 ◽

2020 ◽

Author(s):

David Porubsky ◽

◽

Peter Ebert ◽

Peter A. Audano ◽

Mitchell R. Vollger ◽

...

Keyword(s):

Single Cell ◽

Genome Assembly ◽

De Novo ◽

Error Rates ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

De Novo Genome Assembly ◽

Parental Data ◽

Human Genome Assembly ◽

Long Read

AbstractHuman genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Download Full-text