scholarly journals ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

2016 ◽  
Author(s):  
Shaun D Jackman ◽  
Benjamin P Vandervalk ◽  
Hamid Mohamadi ◽  
Justin Chu ◽  
Sarah Yeo ◽  
...  

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

2017 ◽  
Author(s):  
Patrick Marks ◽  
Sarah Garcia ◽  
Alvaro Martinez Barrio ◽  
Kamila Belhocine ◽  
Jorge Bernate ◽  
...  

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.


2000 ◽  
Vol 113 (18) ◽  
pp. 3207-3216 ◽  
Author(s):  
E. Csonka ◽  
I. Cserpan ◽  
K. Fodor ◽  
G. Hollo ◽  
R. Katona ◽  
...  

An in vivo approach has been developed for generation of artificial chromosomes, based on the induction of intrinsic, large-scale amplification mechanisms of mammalian cells. Here, we describe the successful generation of prototype human satellite DNA-based artificial chromosomes via amplification-dependent de novo chromosome formations induced by integration of exogenous DNA sequences into the centromeric/rDNA regions of human acrocentric chromosomes. Subclones with mitotically stable de novo chromosomes were established, which allowed the initial characterization and purification of these artificial chromosomes. Because of the low complexity of their DNA content, they may serve as a useful tool to study the structure and function of higher eukaryotic chromosomes. Human satellite DNA-based artificial chromosomes containing amplified satellite DNA, rDNA, and exogenous DNA sequences were heterochromatic, however, they provided a suitable chromosomal environment for the expression of the integrated exogenous genetic material. We demonstrate that induced de novo chromosome formation is a reproducible and effective methodology in generating artificial chromosomes from predictable sequences of different mammalian species. Satellite DNA-based artificial chromosomes formed by induced large-scale amplifications on the short arm of human acrocentric chromosomes may become safe or low risk vectors in gene therapy.


Genetics ◽  
2021 ◽  
Author(s):  
Leslie A Mitchell ◽  
Laura H McCulloch ◽  
Sudarshan Pinglay ◽  
Henri Berger ◽  
Nazario Bosco ◽  
...  

Abstract Design and large-scale synthesis of DNA has been applied to the functional study of viral and microbial genomes. New and expanded technology development is required to unlock the transformative potential of such bottom-up approaches to the study of larger mammalian genomes. Two major challenges include assembling and delivering long DNA sequences. Here we describe a workflow for de novo DNA assembly and delivery that enables functional evaluation of mammalian genes on the length scale of 100 kilobase pairs (kb). The DNA assembly step is supported by an integrated robotic workcell. We demonstrate assembly of the 101 kb human HPRT1 gene in yeast from 3 kb building blocks, precision delivery of the resulting construct to mouse embryonic stem cells, and subsequent expression of the human protein from its full-length human gene in mouse cells. This workflow provides a framework for mammalian genome writing. We envision utility in producing designer variants of human genes linked to disease and their delivery and functional analysis in cell culture or animal models.


2013 ◽  
Author(s):  
Giuseppe Narzisi ◽  
Jason A. O’Rawe ◽  
Ivan Iossifov ◽  
Han Fang ◽  
Yoon-ha Lee ◽  
...  

AbstractWe present a new open-source algorithm, Scalpel, for sensitive and specific discovery of INDELs in exome-capture data. By combining the power of mapping and assembly, Scalpel searches the de Bruijn graph for sequence paths (contigs) that span each exon. The algorithm creates a single path for exons with no INDEL, two paths for an exon with a heterozygous mutation, and multiple paths for more exotic variations. A detailed repeat composition analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for INDEL discovery. We extensively compared Scalpel with a battery of >10000 simulated and >1000 experimentally validated INDELs between 1 and 100bp against two recent algorithms for INDEL discovery: GATK HaplotypeCaller and SOAPindel. We report anomalies for these tools in their ability to detect INDELs, especially in regions containing near-perfect repeats which contribute to high false positive rates. In contrast, Scalpel demonstrates superior specificity while maintaining high sensitivity. We also present a large-scale application of Scalpel for detecting de novo and transmitted INDELs in 593 families with autistic children from the Simons Simplex Collection. Scalpel demonstrates enhanced power to detect long (≥20bp) transmitted events, and strengthens previous reports of enrichment for de novo likely gene-disrupting INDEL mutations in children with autism with many new candidate genes. The source code and documentation for the algorithm is available at http://scalpel.sourceforge.net.


2012 ◽  
Vol 23 (02) ◽  
pp. 249-259
Author(s):  
COSTAS S. ILIOPOULOS ◽  
MIRKA MILLER ◽  
SOLON P. PISSIS

One of the most ambitious trends in current biomedical research is the large-scale genomic sequencing of patients. Novel high-throughput (or next-generation) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences (reads) in a single experiment, and with a much lower cost than previously possible. Due to this massive amount of data, efficient algorithms for mapping these sequences to a reference genome are in great demand, and recently, there has been ample work for publishing such algorithms. One important feature of these algorithms is the support of multithreaded parallel computing in order to speedup the mapping process. In this paper, we design parallel algorithms, which make use of the message-passing parallelism model, to address this problem efficiently. The proposed algorithms also take into consideration the probability scores assigned to each base for occurring in a specific position of a sequence. In particular, we present parallel algorithms for mapping short degenerate and weighted DNA sequences to a reference genome.


2014 ◽  
Author(s):  
Rebecca R Murphy ◽  
Jared M O'Connell ◽  
Anthony J Cox ◽  
Ole B Schulz-Trieglaff

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.


2014 ◽  
Author(s):  
Lin Huang ◽  
Bo Wang ◽  
Ruitang Chen ◽  
Sivan Bercovici ◽  
Serafim Batzoglou

Population low-coverage whole-genome sequencing is rapidly emerging as a prominent approach for discovering genomic variation and genotyping a cohort. This approach combines substantially lower cost than full-coverage sequencing with whole-genome discovery of low-allele-frequency variants, to an extent that is not possible with array genotyping or exome sequencing. However, a challenging computational problem arises when attempting to discover variants and genotype the entire cohort. Variant discovery and genotyping are relatively straightforward on a single individual that has been sequenced at high coverage, because the inference decomposes into the independent genotyping of each genomic position for which a sufficient number of confidently mapped reads are available. However, in cases where low-coverage population data are given, the joint inference requires leveraging the complex linkage disequilibrium patterns in the cohort to compensate for sparse and missing data in each individual. The potentially massive computation time for such inference, as well as the missing data that confound low-frequency allele discovery, need to be overcome for this approach to become practical. Here, we present Reveel, a novel method for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage. Reveel introduces a novel technique for leveraging linkage disequilibrium that deviates from previous Markov-based models. We evaluate Reveel???s performance through extensive simulations as well as real data from the 1000 Genomes Project, and show that it achieves higher accuracy in low-frequency allele discovery and substantially lower computation cost than previous state-of-the-art methods.


2014 ◽  
Author(s):  
Rebecca R Murphy ◽  
Jared M O'Connell ◽  
Anthony J Cox ◽  
Ole B Schulz-Trieglaff

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.


2018 ◽  
Author(s):  
Leslie A. Mitchell ◽  
Laura H. McCulloch ◽  
Sudarshan Pinglay ◽  
Henri Berger ◽  
Nazario Bosco ◽  
...  

AbstractDesign and large-scale synthesis of DNA has been applied to the functional study of viral and microbial genomes. New and expanded technology development is required to unlock the transformative potential of such bottom-up approaches to the study of larger mammalian genomes. Two major challenges include assembling and delivering long DNA sequences. Here we describe a pipeline for de novo DNA assembly and delivery that enables functional evaluation of mammalian genes on the length scale of 100 kb. The DNA assembly step is supported by an integrated robotic workcell. We assembled the 101 kb human HPRT1 gene in yeast, delivered it to mouse embryonic stem cells, and showed expression of the human protein from its full-length gene. This pipeline provides a framework for producing systematic, designer variants of any mammalian gene locus for functional evaluation in cells.Significance StatementMammalian genomes consist of a tiny proportion of relatively well-characterized coding regions and vast swaths of poorly characterized “dark matter” containing critical but much less well-defined regulatory sequences. Given the dominant role of noncoding DNA in common human diseases and traits, the interconnectivity of regulatory elements, and the importance of genomic context, de novo design, assembly, and delivery can enable large-scale manipulation of these elements on a locus scale. Here we outline a pipeline for de novo assembly, delivery and expression of mammalian genes replete with native regulatory sequences. We expect this pipeline will be useful for dissecting the function of non-coding sequence variation in mammalian genomes.


2016 ◽  
Vol 44 (3) ◽  
pp. 702-708 ◽  
Author(s):  
Nicola J. Patron

Synthetic biology aims to apply engineering principles to the design and modification of biological systems and to the construction of biological parts and devices. The ability to programme cells by providing new instructions written in DNA is a foundational technology of the field. Large-scale de novo DNA synthesis has accelerated synthetic biology by offering custom-made molecules at ever decreasing costs. However, for large fragments and for experiments in which libraries of DNA sequences are assembled in different combinations, assembly in the laboratory is still desirable. Biological assembly standards allow DNA parts, even those from multiple laboratories and experiments, to be assembled together using the same reagents and protocols. The adoption of such standards for plant synthetic biology has been cohesive for the plant science community, facilitating the application of genome editing technologies to plant systems and streamlining progress in large-scale, multi-laboratory bioengineering projects.


Sign in / Sign up

Export Citation Format

Share Document