Integrating Hi-C links with assembly graphs for chromosome-scale assembly

AbstractLong-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSAAuthor summaryHi-C technology was originally proposed to study the 3D organization of a genome. Recently, it has also been applied to assemble large eukaryotic genomes into chromosome-scale scaffolds. Despite this, there are few open source methods to generate these assemblies. Existing methods are also prone to small inversion errors due to noise in the Hi-C data. In this work, we address these challenges and develop a method, named SALSA2. SALSA2 uses sequence overlap information from an assembly graph to correct inversion errors and provide accurate chromosome-scale assemblies.

Download Full-text

Efficient iterative Hi-C scaffolder based on N-best neighbors

BMC Bioinformatics ◽

10.1186/s12859-021-04453-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Zemin Ning ◽

Guohua Wang ◽

Yadong Wang ◽

...

Keyword(s):

De Novo ◽

A Priori ◽

Sequencing Technology ◽

Current Standard ◽

A Genome ◽

Eukaryotic Species ◽

Long Read ◽

Reference Quality ◽

Comparable Accuracy ◽

Chromosomal Profile

Abstract Background Efficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies. While long read data itself is unlikely to create a chromosome-scale assembly for most eukaryotic species, the inexpensive Hi-C sequencing technology, capable of capturing the chromosomal profile of a genome, is now widely used to complete the task. However, the existing Hi-C based scaffolding tools either require a priori chromosome number as input, or lack the ability to build highly continuous scaffolds. Results We design and develop a novel Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. Subsequent to scaffolding, it identifies potential misjoins and breaks them to keep the scaffolding accuracy. Through our tests on three long read based de novo assemblies from three different species, we demonstrate that pin_hic is more efficient than current standard state-of-art tools, and it can generate much more continuous scaffolds, while achieving a higher or comparable accuracy. Conclusions Pin_hic is an efficient Hi-C based scaffolding tool, which can be useful for building chromosome-scale assemblies. As many sequencing projects have been launched in the recent years, we believe pin_hic has potential to be applied in these projects and makes a meaningful contribution.

Download Full-text

Genotyping structural variants in pangenome graphs using the vg toolkit

10.1101/654566 ◽

2019 ◽

Cited By ~ 7

Author(s):

Glenn Hickey ◽

David Heller ◽

Jean Monlong ◽

Jonas A. Sibbesen ◽

Jouni Sirén ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Effective Means ◽

Point Mutations ◽

Structural Variants ◽

Short Read ◽

Yeast Strains ◽

Sequencing Studies ◽

Long Read

AbstractStructural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

Download Full-text

Highly Contiguous Nanopore Genome Assembly of Chlamydomonas reinhardtii CC-1690

Microbiology Resource Announcements ◽

10.1128/mra.00726-20 ◽

2020 ◽

Vol 9 (37) ◽

Author(s):

Samuel O’Donnell ◽

Frederic Chaux ◽

Gilles Fischer

Keyword(s):

Chlamydomonas Reinhardtii ◽

Genome Size ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Content Type ◽

Oxford Nanopore ◽

A Genome ◽

Reference Quality

ABSTRACT The current Chlamydomonas reinhardtii reference genome remains fragmented due to gaps stemming from large repetitive regions. To overcome the vast majority of these gaps, publicly available Oxford Nanopore Technology data were used to create a new reference-quality de novo genome assembly containing only 21 contigs, 30/34 telomeric ends, and a genome size of 111 Mb.

Download Full-text

A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping

10.1101/327817 ◽

2018 ◽

Cited By ~ 2

Author(s):

Stáphane Deschamps ◽

Yun Zhang ◽

Victor Llaca ◽

Liang Ye ◽

Gregory May ◽

...

Keyword(s):

Sorghum Bicolor ◽

De Novo ◽

Hybrid Assembly ◽

The Public ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Optical Maps ◽

Eukaryotic Genomes ◽

Direct Label

The advent of long-read sequencing technologies has greatly facilitated assemblies of large eukaryotic genomes. In this paper, Oxford Nanopore sequences generated on a MinION sequencer were combined with BioNano Genomics Direct Label and Stain (DLS) optical maps to generate a chromosome-scale de novo assembly of the repeat-rich Sorghum bicolor Tx430 genome. The final hybrid assembly consists of 29 scaffolds, encompassing in most cases entire chromosome arms. It has a scaffold N50 value of 33.28Mbps and covers >90% of Sorghum bicolor expected genome length. A sequence accuracy of 99.67% was obtained in unique regions after aligning contigs against Illumina Tx430 data. Alignments showed that 99.4% of the 34,211 public gene models are present in the assembly, including 94.2% mapping end-to-end. Comparisons of the DLS optical maps against the public Sorghum Bicolor v3.0.1 BTx623 genome assembly suggest the presence of substantial genomic rearrangements whose origin remains to be determined.

Download Full-text

Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies

10.1101/286252 ◽

2018 ◽

Cited By ~ 7

Author(s):

Michael J Roach ◽

Simon Schmidt ◽

Anthony R Borneman

Keyword(s):

De Novo ◽

Haplotype Reconstruction ◽

Minimal Impact ◽

Variant Discovery ◽

Rapid Release ◽

Long Read ◽

Recent Developments ◽

Reference Quality ◽

Downstream Analysis ◽

Genome Assemblies

AbstractRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembling highly heterozygous genomes is still facing a major problem where the two haplotypes for a region are highly polymorphic and the synteny is not recognised during assembly. This causes issues with downstream analysis, for example variant discovery using the haploid assembly, or haplotype reconstruction using the diploid assembly. A new pipeline—Purge Haplotigs—was developed specifically for third-gen assemblies to identify and reassign the duplicate contigs. The pipeline takes a draft haplotype-fused assembly or a diploid assembly, and read alignments to produce an improved assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing. All assemblies after processing with Purge Haplotigs were less duplicated with minimal impact on genome completeness. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

Download Full-text

The Beginning of the End: A Chromosomal Assembly of the New World Malaria Mosquito Ends with a Novel Telomere

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401654 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3811-3819 ◽

Cited By ~ 3

Author(s):

Austin Compton ◽

Jiangtao Liang ◽

Chujia Chen ◽

Varvara Lukyanchikova ◽

Yumin Qi ◽

...

Keyword(s):

New World ◽

Tandem Repeats ◽

De Novo ◽

Mosquito Species ◽

Repeat Unit ◽

Repetitive Sequences ◽

Malaria Mosquito ◽

A Genome ◽

Significant Step ◽

Reference Quality

Chromosome level assemblies are accumulating in various taxonomic groups including mosquitoes. However, even in the few reference-quality mosquito assemblies, a significant portion of the heterochromatic regions including telomeres remain unresolved. Here we produce a de novo assembly of the New World malaria mosquito, Anopheles albimanus by integrating Oxford Nanopore sequencing, Illumina, Hi-C and optical mapping. This 172.6 Mbps female assembly, which we call AalbS3, is obtained by scaffolding polished large contigs (contig N50 = 13.7 Mbps) into three chromosomes. All chromosome arms end with telomeric repeats, which is the first in mosquito assemblies and represents a significant step toward the completion of a genome assembly. These telomeres consist of tandem repeats of a novel 30-32 bp Telomeric Repeat Unit (TRU) and are confirmed by analyzing the termini of long reads and through both chromosomal in situ hybridization and a Bal31 sensitivity assay. The AalbS3 assembly included previously uncharacterized centromeric and rDNA clusters and more than doubled the content of transposable elements and other repetitive sequences. This telomere-to-telomere assembly, although still containing gaps, represents a significant step toward resolving biologically important but previously hidden genomic components. The comparison of different scaffolding methods will also inform future efforts to obtain reference-quality genomes for other mosquito species.

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

Ultra-low input single tube linked-read library method enables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing

10.1101/852947 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zhoutao Chen ◽

Long Pham ◽

Tsai-Chin Wu ◽

Guoya Mo ◽

Yu Xia ◽

...

Keyword(s):

Long Range ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

De Novo Genome Assembly ◽

Short Read ◽

Single Tube ◽

Haplotype Phasing ◽

A Genome ◽

Long Read

AbstractLong-range sequencing information is required for haplotype phasing, de novo assembly and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirement. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-Seq™) technology, which enables a low-cost, high-accuracy and high-throughput short-read next generation sequencer to routinely generate over 100 Kb long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcode linked reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate mega-base-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

Download Full-text

MOSGA: Modular Open-Source Genome Annotator

Bioinformatics ◽

10.1093/bioinformatics/btaa1003 ◽

2020 ◽

Author(s):

Roman Martin ◽

Thomas Hackl ◽

Georges Hattab ◽

Matthias G Fischer ◽

Dominik Heider

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Web Interface ◽

Fully Integrated ◽

Sequencing Technologies ◽

A Genome ◽

Wide Range ◽

User Friendly ◽

Eukaryotic Genomes

Abstract Motivation The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies—a crucial step toward unlocking the biology of the organism of interest—has remained a complex challenge that often requires advanced bioinformatics expertise. Results Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. Availability and implementation We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400392 ◽

2019 ◽

Vol 9 (10) ◽

pp. 3079-3085 ◽

Cited By ~ 8

Author(s):

Joshua A. Udall ◽

Evan Long ◽

Chris Hanson ◽

Daojun Yuan ◽

Thiruvarangan Ramaraj ◽

...

Keyword(s):

Genome Sequence ◽

De Novo ◽

Close Relative ◽

Sequencing Data ◽

Genome Sequences ◽

D Genome ◽

Cotton Species ◽

A Genome ◽

Long Read ◽

Gossypium Raimondii

Cotton is an agriculturally important crop. Because of its importance, a genome sequence of a diploid cotton species (Gossypium raimondii, D-genome) was first assembled using Sanger sequencing data in 2012. Improvements to DNA sequencing technology have improved accuracy and correctness of assembled genome sequences. Here we report a new de novo genome assembly of G. raimondii and its close relative G. turneri. The two genomes were assembled to a chromosome level using PacBio long-read technology, HiC, and Bionano optical mapping. This report corrects some minor assembly errors found in the Sanger assembly of G. raimondii. We also compare the genome sequences of these two species for gene composition, repetitive element composition, and collinearity. Most of the identified structural rearrangements between these two species are due to intra-chromosomal inversions. More inversions were found in the G. turneri genome sequence than the G. raimondii genome sequence. These findings and updates to the D-genome sequence will improve accuracy and translation of genomics to cotton breeding and genetics.

Download Full-text