Fast and accurate reference-guided scaffolding of draft genomes

AbstractBackgroundAs the number of new genome assemblies continues to grow, there is increasing demand for methods to coalesce contigs from draft assemblies into pseudomolecules. Most current methods use genetic maps, optical maps, chromatin conformation (Hi-C), or other long-range linking data, however these data are expensive and analysis methods often fail to accurately order and orient a high percentage of assembly contigs. Other approaches utilize alignments to a reference genome for ordering and orienting, however these tools rely on slow aligners and are not robust to repetitive contigs.ResultsWe present RaGOO, an open-source reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in just minutes. With the pseudomolecules constructed, RaGOO identifies structural variants, including those spanning sequencing gaps that are not reported by alternative methods. We show that RaGOO accurately orders and orients contigs into nearly complete chromosomes based on de novo assemblies of Oxford Nanopore long-read sequencing from three wild and domesticated tomato genotypes, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open-source with an MIT license at https://github.com/malonge/RaGOO.ConclusionsWe demonstrate that with a highly contiguous assembly and a structurally accurate reference genome, reference-guided scaffolding with RaGOO outperforms error-prone reference-free methods and enable rapid pan-genome analysis.

Download Full-text

RaGOO: fast and accurate reference-guided scaffolding of draft genomes

Genome Biology ◽

10.1186/s13059-019-1829-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 56

Author(s):

Michael Alonge ◽

Sebastian Soyk ◽

Srividya Ramakrishnan ◽

Xingang Wang ◽

Sara Goodwin ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Open Source ◽

Genome Analysis ◽

De Novo ◽

Structural Variants ◽

Tomato Genome ◽

Pan Genome ◽

Link Type ◽

Genome Assemblies

Abstract We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO.

Download Full-text

GALA: gap-free chromosome-scale assembly with long reads

10.1101/2020.05.15.097428 ◽

2020 ◽

Author(s):

Mohamed Awad ◽

Xiangchao Gan

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genetic Maps ◽

Sequencing Data ◽

C Elegans ◽

Assembly Method ◽

Long Reads ◽

Long Read ◽

Assembly Technology

AbstractHigh-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows for long-read platforms. Here we propose GALA (Gap-free long-read assembler), a chromosome-by-chromosome assembly method implemented through a multi-layer computer graph that identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates a gap-free assembly free from the mis-assembly errors which usually hamper existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Hi-C, genetic maps, a reference genome and even motif analyses, to generate gap-free chromosome-scale assemblies. We de novo assembled the C. elegans and A. thaliana genomes using combined Pacbio and Nanopore sequencing data from publicly available datasets. We also demonstrated the new method’s applicability with a gap-free assembly of a human genome with the help a reference genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads. Thus, our method enables straightforward assembly of genomes with multiple data sources and overcomes barriers that at present restrict the application of de novo genome assembly technology.

Download Full-text

svviz: a read viewer for validating structural variants

10.1101/016063 ◽

2015 ◽

Cited By ~ 1

Author(s):

Noah Spies ◽

Justin M Zook ◽

Marc Salit ◽

Arend Sidow

Keyword(s):

Open Source ◽

High Throughput Sequencing ◽

Reference Genome ◽

Structural Variants ◽

Insert Size ◽

Reference Allele ◽

Sequencing Platform ◽

Mate Pair ◽

Oxford Nanopore ◽

Long Read

Visualizing read alignments is the most effective way to validate candidate SVs with existing data. We present svviz, a sequencing read visualizer for structural variants (SVs) that sorts and displays only reads relevant to a candidate SV. svviz works by searching input bam(s) for potentially relevant reads, realigning them against the inferred sequence of the putative variant allele as well as the reference allele, and identifying reads that match one allele better than the other. Reads are assigned to the proper allele based on alignment score, read pair orientation and insert size. Separate views of the two alleles are then displayed in a scrollable web browser view, enabling a more intuitive visualization of each allele, compared to the single reference genome-based view common to most current read browsers. The web view facilitates examining the evidence for or against a putative variant, estimating zygosity, visualizing affected genomic annotations, and manual refinement of breakpoints. An optional command-line-only interface allows summary statistics and graphics to be exported directly to standard graphics file formats. svviz is open source and freely available from github, and requires as input only structural variant coordinates (called using any other software package), reads in bam format, and a reference genome. Reads from any high-throughput sequencing platform are supported, including Illumina short-read, mate-pair, synthetic long-read (assembled), Pacific Biosciences, and Oxford Nanopore. svviz is open source and freely available from https://github.com/svviz/svviz.

Download Full-text

Genotyping structural variants in pangenome graphs using the vg toolkit

10.1101/654566 ◽

2019 ◽

Cited By ~ 7

Author(s):

Glenn Hickey ◽

David Heller ◽

Jean Monlong ◽

Jonas A. Sibbesen ◽

Jouni Sirén ◽

...

Keyword(s):

De Novo ◽

State Of The Art ◽

Effective Means ◽

Point Mutations ◽

Structural Variants ◽

Short Read ◽

Yeast Strains ◽

Sequencing Studies ◽

Long Read

AbstractStructural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmarked vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.

Download Full-text

Genome Assembly of the Popular Korean Soybean Cultivar Hwangkeum

10.1101/2021.04.19.440529 ◽

2021 ◽

Author(s):

Myung-Shin Kim ◽

Taeyoung Lee ◽

Jeonghun Baek ◽

Ji Hong Kim ◽

Changhoon Kim ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Anthocyanin Biosynthesis ◽

Genetic Maps ◽

Soybean Cultivar ◽

Gene Arrangement ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Crop Species ◽

Satellite Repeats

AbstractMassive resequencing efforts have been undertaken to catalog allelic variants in major crop species including soybean, but the scope of the information for genetic variation often depends on short sequence reads mapped to the extant reference genome. Additional de novo assembled genome sequences provide a unique opportunity to explore a dispensable genome fraction in the pan-genome of a species. Here, we report the de novo assembly and annotation of Hwangkeum, a popular soybean cultivar in Korea. The assembly was constructed using PromethION nanopore sequencing data and two genetic maps, and was then error-corrected using Illumina short-reads and PacBio SMRT reads. The 933.12 Mb assembly was annotated 79,870 transcripts for 58,550 genes using RNA-Seq data and the public soybean annotation set. Comparison of the Hwangkeum assembly with the Williams 82 soybean reference genome sequence revealed 1.8 million single-nucleotide polymorphisms, 0.5 million indels, and 25 thousand putative structural variants. However, there was no natural megabase-scale chromosomal rearrangement. Incidentally, by adding two novel groups, we found that soybean contains four clearly separated groups of centromeric satellite repeats. Analyses of satellite repeats and gene content suggested that the Hwangkeum assembly is a high-quality assembly. This was further supported by comparison of the marker arrangement of anthocyanin biosynthesis genes and of gene arrangement at the Rsv3 locus. Therefore, the results indicate that the de novo assembly of Hwangkeum is a valuable additional reference genome resource for characterizing traits for the improvement of this important crop species.

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

De novo assembly of a Tibetan genome and identification of novel structural variants associated with high-altitude adaptation

National Science Review ◽

10.1093/nsr/nwz160 ◽

2019 ◽

Vol 7 (2) ◽

pp. 391-402 ◽

Cited By ~ 3

Author(s):

Yaoxi He ◽

Haiyi Lou ◽

Chaoying Cui ◽

Lian Deng ◽

Yang Gao ◽

...

Keyword(s):

High Altitude ◽

De Novo Assembly ◽

De Novo ◽

Population Analysis ◽

Extreme Environments ◽

Structural Variants ◽

Base Pairs ◽

High Quality ◽

Long Read ◽

Altitude Adaptation

Abstract Structural variants (SVs) may play important roles in human adaptation to extreme environments such as high altitude but have been under-investigated. Here, combining long-read sequencing with multiple scaffolding techniques, we assembled a high-quality Tibetan genome (ZF1), with a contig N50 length of 24.57 mega-base pairs (Mb) and a scaffold N50 length of 58.80 Mb. The ZF1 assembly filled 80 remaining N-gaps (0.25 Mb in total length) in the reference human genome (GRCh38). Markedly, we detected 17 900 SVs, among which the ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway. Further population analysis uncovered a 163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese. This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans. Moreover, with the use of the high-quality de novo assembly, we observed a much higher rate of genome-wide archaic hominid (Altai Neanderthal and Denisovan) shared non-reference sequences in ZF1 (1.32%–1.53%) compared to other East Asian genomes (0.70%–0.98%), reflecting a unique genomic composition of Tibetans. One such archaic hominid shared sequence—a 662-bp intronic insertion in the SCUBE2 gene—is enriched and associated with better lung function (the FEV1/FVC ratio) in Tibetans. Collectively, we generated the first high-resolution Tibetan reference genome, and the identified SVs may serve as valuable resources for future evolutionary and medical studies.

Download Full-text

Trio deep-sequencing does not reveal unexpected off-target and on-target mutations in Cas9-edited rhesus monkeys

Nature Communications ◽

10.1038/s41467-019-13481-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 8

Author(s):

Xin Luo ◽

Yaoxi He ◽

Chao Zhang ◽

Xiechao He ◽

Lanzhen Yan ◽

...

Keyword(s):

Rhesus Monkeys ◽

De Novo ◽

Preclinical Model ◽

Structural Variants ◽

Sequencing Data ◽

De Novo Mutations ◽

Target Region ◽

Long Read ◽

Target Effect

AbstractCRISPR-Cas9 is a widely-used genome editing tool, but its off-target effect and on-target complex mutations remain a concern, especially in view of future clinical applications. Non-human primates (NHPs) share close genetic and physiological similarities with humans, making them an ideal preclinical model for developing Cas9-based therapies. However, to our knowledge no comprehensive in vivo off-target and on-target assessment has been conducted in NHPs. Here, we perform whole genome trio sequencing of Cas9-treated rhesus monkeys. We only find a small number of de novo mutations that can be explained by expected spontaneous mutations, and no unexpected off-target mutations (OTMs) were detected. Furthermore, the long-read sequencing data does not detect large structural variants in the target region.

Download Full-text

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401280 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2801-2809 ◽

Cited By ~ 1

Author(s):

Tingting Zhao ◽

Zhongqu Duan ◽

Georgi Z. Genchev ◽

Hui Lu

Keyword(s):

Reference Genome ◽

De Novo ◽

Sequence Length ◽

Sequencing Data ◽

Human Reference Genome ◽

Satellite Sequences ◽

Long Read ◽

Data Gap ◽

Simple Repeats ◽

Gap Closing

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

Download Full-text

Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants

Nature Communications ◽

10.1038/s41467-019-12174-w ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 12

Author(s):

Yaoxi He ◽

Xin Luo ◽

Bin Zhou ◽

Ting Hu ◽

Xiaoyu Meng ◽

...

Keyword(s):

Rhesus Macaque ◽

Genome Assembly ◽

De Novo ◽

Gene Annotation ◽

Large Body ◽

Phenotypic Traits ◽

Structural Variants ◽

De Novo Genome Assembly ◽

Chinese Rhesus Macaque ◽

Long Read

Abstract We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.

Download Full-text