Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text

Fast and Efficient Rmap Assembly using the Bi-labelled de Bruijn Graph

10.21203/rs.3.rs-151901/v1 ◽

2021 ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

Abstract Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as Rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al.(2006) only successfully ran on E. coli. Moreover, on the human genome Rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text

Aligning optical maps to de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz069 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3250-3256 ◽

Cited By ~ 1

Author(s):

Kingshuk Mukherjee ◽

Bahar Alipanahi ◽

Tamer Kahveci ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Sequence Data ◽

Supplementary Information ◽

De Bruijn Graph ◽

Structural Variations ◽

Regular Feature ◽

A Genome ◽

De Bruijn ◽

Optical Maps

Abstract Motivation Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps—called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. Availability and implementation The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Kohdista: an efficient method to index and query possible Rmap alignments

Algorithms for Molecular Biology ◽

10.1186/s13015-019-0160-9 ◽

2019 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Martin D. Muggli ◽

Simon J. Puglisi ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

Restriction Enzymes ◽

Consensus Approach ◽

E Coli ◽

Genome Wide ◽

Alignment Problem ◽

Optical Map ◽

Optical Maps ◽

Map Data ◽

Genomic Regions

Abstract Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

Download Full-text

Assembly of Long Error-Prone Reads Using Repeat Graphs

10.1101/247148 ◽

2018 ◽

Cited By ~ 25

Author(s):

Mikhail Kolmogorov ◽

Jeffrey Yuan ◽

Yu Lin ◽

Pavel. A. Pevzner

Keyword(s):

Single Molecule ◽

State Of The Art ◽

De Bruijn Graph ◽

Single Molecule Sequencing ◽

Short Read ◽

Initial Stage ◽

A Genome ◽

De Bruijn ◽

Initial Assembly

ABSTRACTThe problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.

Download Full-text

SMRT-Cappable-seq reveals complex operon variants in bacteria

10.1101/262964 ◽

2018 ◽

Cited By ~ 1

Author(s):

Bo Yan ◽

Matthew Boitano ◽

Tyson Clark ◽

Laurence Ettwiller

Keyword(s):

Single Molecule ◽

Gene Network ◽

Accurate Identification ◽

E Coli ◽

Short Read Sequencing ◽

Genome Wide ◽

Long Read ◽

Definition Of ◽

Bacterial Genes

AbstractCurrent methods for genome-wide analysis of gene expression requires shredding original transcripts into small fragments for short-read sequencing. In bacteria, the resulting fragmented information hides operon complexity. Additionally,in-vivoprocessing of transcripts confounds the accurate identification of the 5’ and 3’ ends of operons. Here we developed a novel methodology called SMRT-Cappable-seq that combines the isolation of unfragmented primary transcripts with single-molecule long read sequencing. Applied toE. coli, this technology results in an unprecedented definition of the transcriptome with 34% of the known operons being extended by at least one gene. Furthermore, 40% of transcription termination sites have read-through that alters the gene content of the operons. As a result, most of the bacterial genes are present in multiple operon variants reminiscent of eukaryotic splicing. By providing an unprecedented granularity in the operon structure, this study represents an important resource for the study of prokaryotic gene network and regulation.

Download Full-text

Elucidating acetate tolerance in E. coli using a genome-wide approach

Metabolic Engineering ◽

10.1016/j.ymben.2010.12.001 ◽

2011 ◽

Vol 13 (2) ◽

pp. 214-224 ◽

Cited By ~ 49

Author(s):

Nicholas R. Sandoval ◽

Tirzah Y. Mills ◽

Min Zhang ◽

Ryan T. Gill

Keyword(s):

E Coli ◽

Genome Wide ◽

A Genome

Download Full-text

sPepFinder expedites genome-wide identification of small proteins in bacteria

10.1101/2020.05.05.079178 ◽

2020 ◽

Author(s):

Lei Li ◽

Yanjie Chao

Keyword(s):

De Novo ◽

Bacterial Species ◽

Computational Prediction ◽

Ribosome Profiling ◽

Support Vector ◽

Initiation Rate ◽

E Coli ◽

Small Proteins ◽

Genome Wide ◽

A Genome

ABSTRACTSmall proteins shorter than 50 amino acids have been long overlooked. A number of small proteins have been identified in several model bacteria using experimental approaches and assigned important functions in diverse cellular processes. The recent development of ribosome profiling technologies has allowed a genome-wide identification of small proteins and small ORFs (smORFs), but our incomplete understanding of small proteins hinders de novo computational prediction of smORFs in non-model bacterial species. Here, we have identified several sequence features for smORFs by a systematic analysis of all the known small proteins in E. coli, among which the translation initiation rate is the strongest determinant. By integrating these features into a support vector machine learning model, we have developed a novel sPepFinder algorithm that can predict conserved smORFs in bacterial genomes with a high accuracy of 92.8%. De novo prediction in E. coli has revealed several novel smORFs with evidence of translation supported by ribosome profiling. Further application of sPepFinder in 549 bacterial species has led to the identification of > 100,000 novel smORFs, many of which are conserved at the amino acid and nucleotide levels under purifying selection. Overall, we have established sPepFinder as a valuable tool to identify novel smORFs in both model and non-model bacterial organisms, and provided a large resource of small proteins for functional characterizations.

Download Full-text

Each of 3,323 metabolic innovations in the evolution ofE. coliarose through the horizontal transfer of a single DNA segment

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1718997115 ◽

2018 ◽

Vol 116 (1) ◽

pp. 187-192 ◽

Cited By ~ 11

Author(s):

Tin Yau Pang ◽

Martin J. Lercher

Keyword(s):

E Coli ◽

New Genes ◽

Metabolic Adaptations ◽

Genome Wide ◽

A Genome ◽

Individual Strain ◽

History Of ◽

Phenotype Space ◽

Genome Content ◽

Metabolic Models

Even closely related prokaryotes often show an astounding diversity in their ability to grow in different nutritional environments. It has been hypothesized that complex metabolic adaptations—those requiring the independent acquisition of multiple new genes—can evolve via selectively neutral intermediates. However, it is unclear whether this neutral exploration of phenotype space occurs in nature, or what fraction of metabolic adaptations is indeed complex. Here, we reconstruct metabolic models for the ancestors of a phylogeny of 53Escherichia colistrains, linking genotypes to phenotypes on a genome-wide, macroevolutionary scale. Based on the ancestral and extant metabolic models, we identify 3,323 phenotypic innovations in the history of theE. coliclade that arose through changes in accessory genome content. Of these innovations, 1,998 allow growth in previously inaccessible environments, while 1,325 increase biomass yield. Strikingly, every observed innovation arose through the horizontal acquisition of a single DNA segment less than 30 kb long. Although we found no evidence for the contribution of selectively neutral processes, 10.6% of metabolic innovations were facilitated by horizontal gene transfers on earlier phylogenetic branches, consistent with a stepwise adaptation to successive environments. Ninety-eight percent of metabolic phenotypes accessible to the combinedE. colipangenome can be bestowed on any individual strain by transferring a single DNA segment from one of the extant strains. These results demonstrate an amazing ability of theE. colilineage to adapt to novel environments through single horizontal gene transfers (followed by regulatory adaptations), an ability likely mirrored in other clades of generalist bacteria.

Download Full-text

Live-cell single particle imaging reveals the role of RNA polymerase II in histone H2A.Z eviction

eLife ◽

10.7554/elife.55667 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 5

Author(s):

Anand Ranjan ◽

Vu Q Nguyen ◽

Sheng Liu ◽

Jan Wisniewski ◽

Jee Min Kim ◽

...

Keyword(s):

Rna Polymerase ◽

Rna Polymerase Ii ◽

Single Molecule ◽

Transcription Initiation ◽

General Mechanism ◽

Pol Ii ◽

Promoter Escape ◽

Genome Wide ◽

A Genome ◽

Particle Imaging

The H2A.Z histone variant, a genome-wide hallmark of permissive chromatin, is enriched near transcription start sites in all eukaryotes. H2A.Z is deposited by the SWR1 chromatin remodeler and evicted by unclear mechanisms. We tracked H2A.Z in living yeast at single-molecule resolution, and found that H2A.Z eviction is dependent on RNA Polymerase II (Pol II) and the Kin28/Cdk7 kinase, which phosphorylates Serine 5 of heptapeptide repeats on the carboxy-terminal domain of the largest Pol II subunit Rpb1. These findings link H2A.Z eviction to transcription initiation, promoter escape and early elongation activities of Pol II. Because passage of Pol II through +1 nucleosomes genome-wide would obligate H2A.Z turnover, we propose that global transcription at yeast promoters is responsible for eviction of H2A.Z. Such usage of yeast Pol II suggests a general mechanism coupling eukaryotic transcription to erasure of the H2A.Z epigenetic signal.

Download Full-text

Evidence of evolutionary selection for co-translational folding

10.1101/121871 ◽

2017 ◽

Author(s):

William M. Jacobs ◽

Eugene I. Shakhnovich

Keyword(s):

Self Assembly ◽

E Coli ◽

Evolutionary Selection ◽

Fitness Effects ◽

Open Questions ◽

Genome Wide ◽

A Genome ◽

Evolutionarily Conserved ◽

Domain Boundaries ◽

Selection For

Recent experiments and simulations have demonstrated that proteins can fold on the ribosome. However, the extent and generality of fitness effects resulting from co-translational folding remain open questions. Here we report a genome-wide analysis that uncovers evidence of evolutionary selection for co-translational folding. We describe a robust statistical approach to identify loci within genes that are both significantly enriched in slowly translated codons and evolutionarily conserved. Surprisingly, we find that domain boundaries can explain only a small fraction of these conserved loci. Instead, we propose that regions enriched in slowly translated codons are associated with co-translational folding intermediates, which may be smaller than a single domain. We show that the intermediates predicted by a native-centric model of co-translational folding account for the majority of these loci across more than 500 E. coli proteins. By making a direct connection to protein folding, this analysis provides strong evidence that many synonymous substitutions have been selected to optimize translation rates at specific locations within genes. More generally, our results indicate that kinetics, and not just thermodynamics, can significantly alter the efficiency of self-assembly in a biological context.

Download Full-text