Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system

AbstractLong-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

Download Full-text

Oxford Nanopore Sequencing, Hybrid Error Correction, and de novo Assembly of a Eukaryotic Genome

10.1101/013490 ◽

2015 ◽

Cited By ~ 23

Author(s):

Sara Goodwin ◽

James Gurtowski ◽

Scott Ethe-Sayers ◽

Panchajanya Deshpande ◽

Michael Schatz ◽

...

Keyword(s):

Error Correction ◽

De Novo Assembly ◽

De Novo ◽

Correction Algorithm ◽

Membrane Pore ◽

Complete Representation ◽

Oxford Nanopore ◽

Long Read ◽

Error Correction Algorithm ◽

Sequencing Instrument

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available that we used for sequencing the S. cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr (https://github.com/jgurtowski/nanocorr) specifically for Oxford Nanopore reads, as existing packages were incapable of assembling the long read lengths (5-50kbp) at such high error rate (between ~5 and 40% error). With this new method we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: the contig N50 length is more than ten-times greater than an Illumina-only assembly (678kb versus 59.9kbp), and has greater than 99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

Download Full-text

De novo assembly of a Tibetan genome and identification of novel structural variants associated with high-altitude adaptation

National Science Review ◽

10.1093/nsr/nwz160 ◽

2019 ◽

Vol 7 (2) ◽

pp. 391-402 ◽

Cited By ~ 3

Author(s):

Yaoxi He ◽

Haiyi Lou ◽

Chaoying Cui ◽

Lian Deng ◽

Yang Gao ◽

...

Keyword(s):

High Altitude ◽

De Novo Assembly ◽

De Novo ◽

Population Analysis ◽

Extreme Environments ◽

Structural Variants ◽

Base Pairs ◽

High Quality ◽

Long Read ◽

Altitude Adaptation

Abstract Structural variants (SVs) may play important roles in human adaptation to extreme environments such as high altitude but have been under-investigated. Here, combining long-read sequencing with multiple scaffolding techniques, we assembled a high-quality Tibetan genome (ZF1), with a contig N50 length of 24.57 mega-base pairs (Mb) and a scaffold N50 length of 58.80 Mb. The ZF1 assembly filled 80 remaining N-gaps (0.25 Mb in total length) in the reference human genome (GRCh38). Markedly, we detected 17 900 SVs, among which the ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway. Further population analysis uncovered a 163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese. This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans. Moreover, with the use of the high-quality de novo assembly, we observed a much higher rate of genome-wide archaic hominid (Altai Neanderthal and Denisovan) shared non-reference sequences in ZF1 (1.32%–1.53%) compared to other East Asian genomes (0.70%–0.98%), reflecting a unique genomic composition of Tibetans. One such archaic hominid shared sequence—a 662-bp intronic insertion in the SCUBE2 gene—is enriched and associated with better lung function (the FEV1/FVC ratio) in Tibetans. Collectively, we generated the first high-resolution Tibetan reference genome, and the identified SVs may serve as valuable resources for future evolutionary and medical studies.

Download Full-text

High contiguity long read assembly of Brassica nigra allows localization of active centromeres and provides insights into the ancestral Brassica genome

10.1101/2020.02.03.932665 ◽

2020 ◽

Cited By ~ 5

Author(s):

Sampath Perumal ◽

Chu Shin Koh ◽

Lingling Jin ◽

Miles Buchwaldt ◽

Erin Higgins ◽

...

Keyword(s):

De Novo ◽

Low Complexity ◽

Error Rates ◽

Brassica Nigra ◽

Genome Integrity ◽

Ancestral Genome ◽

Genomic Distance ◽

Long Read ◽

Genome Assemblies ◽

Technology Comparison

AbstractHigh-quality nanopore genome assemblies were generated for two Brassica nigra genotypes (Ni100 and CN115125); a member of the agronomically important Brassica species. The N50 contig length for the two assemblies were 17.1 Mb (58 contigs) and 0.29 Mb (963 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short read assembly for Ni100 corroborated genome integrity and quantified sequence related error rates (0.002%). The contiguity and coverage allowed unprecedented access to low complexity regions of the genome. Pericentromeric regions and coincidence of hypo-methylation enabled localization of active centromeres and identified a novel centromere-associated ALE class I element which appears to have proliferated through relatively recent nested transposition events (<1 million years ago). Computational abstraction was used to define a post-triplication Brassica specific ancestral genome and to calculate the extensive rearrangements that define the genomic distance separating B. nigra from its diploid relatives.

Download Full-text

De novo assembly of Dekkera bruxellensis: a multi technology approach using short and long-read sequencing and optical mapping

GigaScience ◽

10.1186/s13742-015-0094-1 ◽

2015 ◽

Vol 4 (1) ◽

Cited By ~ 17

Author(s):

Remi-Andre Olsen ◽

Ignas Bunikis ◽

Ievgeniia Tiukova ◽

Kicki Holmberg ◽

Britta Lötstedt ◽

...

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Optical Mapping ◽

Dekkera Bruxellensis ◽

Long Read

Download Full-text

De novo diploid genome assembly for genome-wide structural variant detection

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz018 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Lu Zhang ◽

Xin Zhou ◽

Ziming Weng ◽

Arend Sidow

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Pairwise Alignment ◽

Cost Effective ◽

Difficult Problem ◽

Ancestral State ◽

Fundamental Limitations ◽

Human Genomes ◽

Genome Wide ◽

Long Read

Abstract Detection of structural variants (SVs) on the basis of read alignment to a reference genome remains a difficult problem. De novo assembly, traditionally used to generate reference genomes, offers an alternative for SV detection. However, it has not been applied broadly to human genomes because of fundamental limitations of short-fragment approaches and high cost of long-read technologies. We here show that 10× linked-read sequencing supports accurate SV detection. We examined variants in six de novo 10× assemblies with diverse experimental parameters from two commonly used human cell lines: NA12878 and NA24385. The assemblies are effective for detecting mid-size SVs, which were discovered by simple pairwise alignment of the assemblies’ contigs to the reference (hg38). Our study also shows that the base-pair level SV breakpoint accuracy is high, with a majority of SVs having precisely correct sizes and breakpoints. Setting the ancestral state of SV loci by comparing to ape orthologs allows inference of the actual molecular mechanism (insertion or deletion) causing the mutation. In about half of cases, the mechanism is the opposite of the reference-based call. We uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10× linked-read data can achieve cost-effective SV detection for personal genomes.

Download Full-text

De novo assembly of the olive fruit fly (Bactrocera oleae) genome with linked-reads and long-read technologies minimizes gaps and provides exceptional Y chromosome assembly

BMC Genomics ◽

10.1186/s12864-020-6672-3 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Anthony Bayega ◽

Haig Djambazian ◽

Konstantina T. Tsoumani ◽

Maria-Eleni Gregoriou ◽

Efthimia Sagri ◽

...

Keyword(s):

Y Chromosome ◽

De Novo Assembly ◽

De Novo ◽

Fruit Fly ◽

Bactrocera Oleae ◽

Olive Fruit Fly ◽

Olive Fruit ◽

Long Read

Download Full-text

Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes

Trends in Plant Science ◽

10.1016/j.tplants.2019.05.003 ◽

2019 ◽

Vol 24 (8) ◽

pp. 700-724 ◽

Cited By ~ 23

Author(s):

Hyungtaek Jung ◽

Christopher Winefield ◽

Aureliano Bombarely ◽

Peter Prentis ◽

Peter Waterhouse

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Plant Genomes ◽

Long Read

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text