Kohdista: an efficient method to index and query possible Rmap alignments

Abstract Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

Download Full-text

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00182-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text

Fast and Efficient Rmap Assembly using the Bi-labelled de Bruijn Graph

10.21203/rs.3.rs-151901/v1 ◽

2021 ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

Abstract Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as Rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al.(2006) only successfully ran on E. coli. Moreover, on the human genome Rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text

Genome-Wide Identification of 5-Methylcytosine Sites in Bacterial Genomes By High-Throughput Sequencing of MspJI Restriction Fragments

10.1101/2021.02.10.430591 ◽

2021 ◽

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

ABSTRACTSingle-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Genome-wide identification of 5-methylcytosine sites in bacterial genomes by high-throughput sequencing of MspJI restriction fragments

PLoS ONE ◽

10.1371/journal.pone.0247541 ◽

2021 ◽

Vol 16 (5) ◽

pp. e0247541

Author(s):

Brian P. Anton ◽

Alexey Fomenkov ◽

Victoria Wu ◽

Richard J. Roberts

Keyword(s):

Single Molecule ◽

Dna Sequences ◽

High Throughput Sequencing ◽

Cost Effective ◽

Restriction Enzymes ◽

Specific Sequence ◽

Genome Wide ◽

Cost Effective Alternative ◽

Simple Column ◽

Sequencing Platforms

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

Download Full-text

Finding Overlapping Rmaps via Gaussian Mixture Model Clustering

10.1101/2021.07.16.452722 ◽

2021 ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Daniel Dole-Muinos ◽

Ayomide Ajayi ◽

Mattia Prosperi ◽

...

Keyword(s):

Single Molecule ◽

Optical Mapping ◽

Gaussian Mixture ◽

Restriction Maps ◽

Entire Genome ◽

Cpu Time ◽

Genome Wide ◽

Genomic Location ◽

Optical Maps ◽

Mixture Model Clustering

Optical mapping is a method for creating high resolution restriction maps of an entire genome. Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as OMclust, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and cOMet) to demonstrate the increase in the performance of these methods. When OMclust was combined with cOMet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Rmaps, and reduced the CPU time by more than 35x. Our software is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/OMclust

Download Full-text

Use of NAD tagSeq II to identify growth phase-dependent alterations in E. coli RNA NAD+ capping

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2026183118 ◽

2021 ◽

Vol 118 (14) ◽

pp. e2026183118

Author(s):

Hailei Zhang ◽

Huan Zhong ◽

Xufeng Wang ◽

Shoudong Zhang ◽

Xiaojian Shao ◽

...

Keyword(s):

Single Molecule ◽

Copper Ions ◽

Stationary Phases ◽

Regulate Gene Expression ◽

E Coli ◽

Genome Wide ◽

Highly Expressed Genes ◽

Transcript Profiles ◽

Rna Capping ◽

Free Strain

Recent findings regarding nicotinamide adenine dinucleotide (NAD+)-capped RNAs (NAD-RNAs) indicate that prokaryotes and eukaryotes employ noncanonical RNA capping to regulate gene expression. Two methods for transcriptome-wide analysis of NAD-RNAs, NAD captureSeq and NAD tagSeq, are based on copper-catalyzed azide-alkyne cycloaddition (CuAAC) click chemistry to label NAD-RNAs. However, copper ions can fragment/degrade RNA, interfering with the analyses. Here we report development of NAD tagSeq II, which uses copper-free, strain-promoted azide-alkyne cycloaddition (SPAAC) for labeling NAD-RNAs, followed by identification of tagged RNA by single-molecule direct RNA sequencing. We used this method to compare NAD-RNA and total transcript profiles of Escherichia coli cells in the exponential and stationary phases. We identified hundreds of NAD-RNA species in E. coli and revealed genome-wide alterations of NAD-RNA profiles in the different growth phases. Although no or few NAD-RNAs were detected from some of the most highly expressed genes, the transcripts of some genes were found to be primarily NAD-RNAs. Our study suggests that NAD-RNAs play roles in linking nutrient cues with gene regulation in E. coli.

Download Full-text

Detecting Large Indels Using Optical Map Data

10.1101/382986 ◽

2018 ◽

Author(s):

Xian Fan ◽

Jie Xu ◽

Luay Nakhleh

Keyword(s):

Complete Characterization ◽

False Discovery ◽

Lower False Discovery Rate ◽

Optical Map ◽

Local Assembly ◽

Complex Events ◽

Large Indels ◽

Optical Maps ◽

Map Data

AbstractOptical Maps (OM) provide reads that are very long, and thus can be used to detect large indels not detectable by the shorter reads provided by sequence-based technologies such as Illumina and PacBio. Two existing tools for detecting large indels from OM data are BioNano Solve and OMSV. However, these two tools may miss indels with weak signals. We propose a local-assembly based approach, OMIndel, to detect large indels with OM data. The results of applying OMIndel to empirical data demonstrate that it is able to detect indels with weak signal. Furthermore, compared with the other two OM-based methods, OMIndel has a lower false discovery rate. We also investigated the indels that can only be detected by OM but not Illumina, PacBio or 10X, and we found that they mostly fall into two categories: complex events or indels on repetitive regions. This implies that adding the OM data to sequence-based technologies can provide significant progress towards a more complete characterization of structural variants (SVs). The algorithm has been implemented in Perl and is publicly available onhttps://bitbucket.org/xianfan/optmethod.

Download Full-text

Genome-wide epigenetic profiling of 5-hydroxymethylcytosine by long-read optical mapping

10.1101/260166 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tslil Gabrieli ◽

Hila Sharim ◽

Gil Nifker ◽

Jonathan Jeffet ◽

Tamar Shahal ◽

...

Keyword(s):

Long Range ◽

Single Molecule ◽

Human Peripheral Blood ◽

Read Length ◽

Epigenetic Mark ◽

Sequencing Data ◽

Chromosomal Dna ◽

Genome Wide ◽

Long Read ◽

Genomic Regions

AbstractThe epigenetic mark 5-hydroxymethylcytosine (5-hmC) is a distinct product of active enzymatic demethylation that is linked to gene regulation, development and disease. Genome-wide 5-hmC profiles generated by short-read next-generation sequencing are limited in providing long-range epigenetic information relevant to highly variable genomic regions, such as the 3.7 Mbp disease-related Human Leukocyte Antigen (HLA) region. We present a long-read, single-molecule mapping technology that generates hybrid genetic/epigenetic profiles of native chromosomal DNA. The genome-wide distribution of 5- hmC in human peripheral blood cells correlates well with 5-hmC DNA immunoprecipitation (hMeDIP) sequencing. However, the long read length of 100 kbp-1Mbp produces 5-hmC profiles across variable genomic regions that failed to showup in the sequencing data. In addition, optical 5-hmC mapping shows strong correlation between the 5-hmC density in gene bodies and the corresponding level of gene expression. The single molecule concept provides information on the distribution and coexistence of 5-hmC signals at multiple genomic loci on the same genomic DNA molecule, revealing long-range correlations and cell-to-cell epigenetic variation.

Download Full-text

SMRT-Cappable-seq reveals complex operon variants in bacteria

10.1101/262964 ◽

2018 ◽

Cited By ~ 1

Author(s):

Bo Yan ◽

Matthew Boitano ◽

Tyson Clark ◽

Laurence Ettwiller

Keyword(s):

Single Molecule ◽

Gene Network ◽

Accurate Identification ◽

E Coli ◽

Short Read Sequencing ◽

Genome Wide ◽

Long Read ◽

Definition Of ◽

Bacterial Genes

AbstractCurrent methods for genome-wide analysis of gene expression requires shredding original transcripts into small fragments for short-read sequencing. In bacteria, the resulting fragmented information hides operon complexity. Additionally,in-vivoprocessing of transcripts confounds the accurate identification of the 5’ and 3’ ends of operons. Here we developed a novel methodology called SMRT-Cappable-seq that combines the isolation of unfragmented primary transcripts with single-molecule long read sequencing. Applied toE. coli, this technology results in an unprecedented definition of the transcriptome with 34% of the known operons being extended by at least one gene. Furthermore, 40% of transcription termination sites have read-through that alters the gene content of the operons. As a result, most of the bacterial genes are present in multiple operon variants reminiscent of eukaryotic splicing. By providing an unprecedented granularity in the operon structure, this study represents an important resource for the study of prokaryotic gene network and regulation.

Download Full-text

Faculty Opinions recommendation of Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.717963699.793465590 ◽

2012 ◽

Author(s):

Martin Marinus

Keyword(s):

Escherichia Coli ◽

Real Time ◽

Single Molecule ◽

Genome Wide ◽

Pathogenic Escherichia Coli

Download Full-text