scholarly journals AsmMix: A pipeline for high quality diploid de novo assembly

2021 ◽  
Author(s):  
Pei Wu ◽  
Chao Liu ◽  
Ou Wang ◽  
Xia Zhao ◽  
Fang Chen ◽  
...  

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

2018 ◽  
Author(s):  
Ou Wang ◽  
Robert Chin ◽  
Xiaofang Cheng ◽  
Michelle Ka Wu ◽  
Qing Mao ◽  
...  

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.


Gigabyte ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Hailin Liu ◽  
Shigang Wu ◽  
Alun Li ◽  
Jue Ruan

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It has also been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler SMARTdenovo, a single-molecule sequencing (SMS) assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a rapid assembler, which, unlike contemporaneous SMS assemblers, does not require highly accurate raw reads for error correction. It has performed well in the evaluation of congeneric assemblers and has been successfully users for various assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015; here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.


Author(s):  
Hailin Liu ◽  
Shigang Wu ◽  
Alun Li ◽  
Jue Ruan

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It also has been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler— SMARTdenovo, which is an SMS assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a fast assembler that did not require highly accurate raw reads for error correction, unlike other, contemporaneous SMS assemblers. It has performed well for evaluating congeneric assemblers and has been successful for a variety of assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015, and here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 961
Author(s):  
Kevin McKernan ◽  
Liam Kane ◽  
Yvonne Helbert ◽  
Lei Zhang ◽  
Nathan Houde ◽  
...  

The Psilocybe genus is well known for the synthesis of valuable psychoactive compounds such as Psilocybin, Psilocin, Baeocystin and Aeruginascin. The ubiquity of Psilocybin synthesis in Psilocybe has been attributed to a horizontal gene transfer mechanism of a ~20Kb gene cluster. A recently published highly contiguous reference genome derived from long read single molecule sequencing has underscored interesting variation in this Psilocybin synthesis gene cluster. This reference genome has also enabled the shotgun sequencing of spores from many Psilocybe strains to better catalog the genomic diversity in the Psilocybin synthesis pathway. Here we present the de novo assembly of 81 Psilocybe genomes compared to the P.envy reference genome. Surprisingly, the genomes of Psilocybe galindoi, Psilocybe tampanensis and Psilocybe azurescens lack sequence coverage over the previously described Psilocybin synthesis pathway but do demonstrate amino acid sequence homology to a less contiguous gene cluster and may illuminate the previously proposed evolution of psilocybin synthesis.


mBio ◽  
2016 ◽  
Vol 7 (1) ◽  
Author(s):  
Yu-Chih Tsai ◽  
Sean Conlan ◽  
Clayton Deming ◽  
Julia A. Segre ◽  
Heidi H. Kong ◽  
...  

ABSTRACT Deep metagenomic shotgun sequencing has emerged as a powerful tool to interrogate composition and function of complex microbial communities. Computational approaches to assemble genome fragments have been demonstrated to be an effective tool for de novo reconstruction of genomes from these communities. However, the resultant “genomes” are typically fragmented and incomplete due to the limited ability of short-read sequence data to assemble complex or low-coverage regions. Here, we use single-molecule, real-time (SMRT) sequencing to reconstruct a high-quality, closed genome of a previously uncharacterized Corynebacterium simulans and its companion bacteriophage from a skin metagenomic sample. Considerable improvement in assembly quality occurs in hybrid approaches incorporating short-read data, with even relatively small amounts of long-read data being sufficient to improve metagenome reconstruction. Using short-read data to evaluate strain variation of this C. simulans in its skin community at single-nucleotide resolution, we observed a dominant C. simulans strain with moderate allelic heterozygosity throughout the population. We demonstrate the utility of SMRT sequencing and hybrid approaches in metagenome quantitation, reconstruction, and annotation. IMPORTANCE The species comprising a microbial community are often difficult to deconvolute due to technical limitations inherent to most short-read sequencing technologies. Here, we leverage new advances in sequencing technology, single-molecule sequencing, to significantly improve reconstruction of a complex human skin microbial community. With this long-read technology, we were able to reconstruct and annotate a closed, high-quality genome of a previously uncharacterized skin species. We demonstrate that hybrid approaches with short-read technology are sufficiently powerful to reconstruct even single-nucleotide polymorphism level variation of species in this a community.


2021 ◽  
Author(s):  
Ying Chen ◽  
Jiajun Qiu ◽  
Yingwei Wu ◽  
Huan Jia ◽  
Yi Jiang ◽  
...  

Abstract BackgroundPOU3F4 is the causative gene for X-linked deafness-2 (DFNX2), characterized by incomplete partition type III (IP-III) malformation of the inner ear. The aim of this study was to investigate the clinical characteristics and molecular findings by Sanger or Nanopore single-molecule sequencing in IP-III patients. MethodsDiagnosis of IP-III was mainly based on clinical characteristics including radiological and audiological findings. Sanger sequencing of POU3F4 were carried out for these IP-III patients. For those patients with negative results for POU3F4 Sanger sequencing, Nanopore long-read single-molecule sequencing was used to identify the possible pathogenic variants. Hearing intervention outcomes of hearing aids fitting and cochlear implantation were also analyzed. Grouped by different locations of POU3F4 variants, aided PTA was further compared between patients in whom the variants located in the exon region or in the upstream region.ResultsIn total, 18 male patients from 14 unrelated families were diagnosed with IP-III. 10 variants were identified in POU3F4 by Sanger sequencing and 9 of these were novel (p.Val321Gly, p.Gln181*, p.Cys233*, p.Val215Gly, p.Arg282Gln, p.Trp57*, p.Gln316*, c.903_912 delins TGCCA and p.Arg205del). Four different deletions (DELs) that varied from 80 to 486 kb were identified 876-1503 kb upstream of POU3F4 by Nanopore long-read single-molecule sequencing. Of them, de novo genetic mutations occurred in 21.4% (3/14) of patients with POU3F4 mutations. Of these 18 patients, 7 had bilateral hearing aids (HAs) and 10 patients received unilateral cochlear implantation (CI). The mean aided pure tone average (PTA) for HAs and CI users were 41.1±5.18 and 40.3±7.59 dB HL respectively. The mean PTAs for whom the variants located in the exon and upstream regions were 39.6±6.31 vs 43.0±7.10 dB HL, which presented no significant difference (p=0.342).ConclusionsAmong IP-III patients, 28.6% (4/14) had no definite mutation in exon region of POU3F4, however, possible pathogenic deletions were identified in upstream region of this gen. De novo genetic mutations occurred in 21.4% (3/14) of patients with POU3F4 mutation. Hearing intervention outcomes of IP-III patients presented no difference regardless of the variants locations on exon or upstream regions.


2014 ◽  
Author(s):  
Konstantin Berlin ◽  
Sergey Koren ◽  
Chen-Shan Chin ◽  
James Drake ◽  
Jane M Landolin ◽  
...  

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.


2018 ◽  
Author(s):  
Jay Ghurye ◽  
Sergey Koren ◽  
Scott T Small ◽  
Seth Redmond ◽  
Paul Howell ◽  
...  

Background: Anopheles funestus is one of the three most consequential and widespread vectors of human malaria in tropical Africa. However, the lack of a high-quality reference genome has hindered the association of phenotypic traits with their genetic basis in this important mosquito. Findings: Here we present a new high-quality An. funestus reference genome (AfunF3) assembled using 240x coverage of long-read single-molecule sequencing for contigging, combined with 100x coverage of short-read Hi-C data for chromosome scaffolding. The assembled contigs total 446 Mbp of sequence and contain substantial duplication due to alternative alleles present in the sequenced pool of mosquitos from the FUMOZ colony. Using alignment and depth-of-coverage information, these contigs were deduplicated to a 211 Mbp primary assembly, which is closer to the expected haploid genome size of 250 Mbp. This primary assembly consists of 1,053 contigs organized into 3 chromosome-scale scaffolds with an N50 contig size of 632 kbp and an N50 scaffold size of 93.811 Mbp, representing a 100-fold improvement in continuity versus the current reference assembly, AfunF1. Conclusion: This highly contiguous and complete An. funestus reference genome assembly will serve as an improved basis for future studies of genomic variation and organization in this important disease vector.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Peter Edge ◽  
Vikas Bansal

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.


2021 ◽  
Vol 17 (6) ◽  
pp. e1009078
Author(s):  
Jingwen Ren ◽  
Mark J. P. Chaisson

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


Sign in / Sign up

Export Citation Format

Share Document