scholarly journals Finding Overlapping Rmaps via Gaussian Mixture Model Clustering

2021 ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Daniel Dole-Muinos ◽  
Ayomide Ajayi ◽  
Mattia Prosperi ◽  
...  

Optical mapping is a method for creating high resolution restriction maps of an entire genome. Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as OMclust, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and cOMet) to demonstrate the increase in the performance of these methods. When OMclust was combined with cOMet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Rmaps, and reduced the CPU time by more than 35x. Our software is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/OMclust

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Leena Salmela ◽  
Christina Boucher

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.


2019 ◽  
Vol 14 (1) ◽  
Author(s):  
Martin D. Muggli ◽  
Simon J. Puglisi ◽  
Christina Boucher

Abstract Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.


2021 ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Leena Salmela ◽  
Christina Boucher

Abstract Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as Rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al.(2006) only successfully ran on E. coli. Moreover, on the human genome Rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.


Blood ◽  
2012 ◽  
Vol 120 (21) ◽  
pp. 2444-2444
Author(s):  
Aditya Gupta ◽  
Jaehyup Kim ◽  
Chelsea Hope ◽  
Jeff Jensen ◽  
Natalie Callander ◽  
...  

Abstract Abstract 2444 Multiple myeloma genomes are characterized by complex structural and numerical abnormalities. Proteasome inhibitors are routinely used to treat multiple myeloma. Despite significant clinical success with these agents, development of resistance often limits therapeutic benefit. However, many questions remain unanswered regarding the molecular mechanisms underlying acquired resistance to proteasome inhibitors. In order to understand the dynamics of structural evolution of the multiple myeloma genome under selective pressure afforded by proteasome inhibition and to identify targets to overcome acquired resistance, we derived global optical maps of two myeloma cancer genomes (DNA extracted from CD138+ tumor cells), obtained sequentially from the same patient before and after development of resistance to bortezomib, the prototypical therapeutic proteasome inhibitor. Optical Mapping offers a high throughput, single molecule, whole genome analysis that offers the highest rate of discovery of structural and numerical variants, free of the confounders associated with hybridization-based approaches. Briefly, the Optical Mapping System assembles entire genomes from large datasets of Rmaps (Rmap = a restriction-mapped individual genomic DNA molecule- see Figure 1) from which novel balanced and complex structural variants (2 kb – entire chromosomes) are discovered and tabulated by our pipeline (Figure 2). We identified multiple structural variants including single nucleotide variations (SNVs), deletions, insertions, inversions, and loss of heterozygosity regions across the entire genome. Some of these variants are common to both bortezomib-sensitive and bortezomib-resistant genomes. We also discovered variants that were unique to the bortezomib resistant genome, implicating a role in acquisition of drug resistance. Many of these structural variants encompass genes, some of which have not been previously associated with multiple myeloma and bortezomib resistance, thus providing a rationale for further interrogation of these novel targets. In addition to novel potential targets, known recurrent events including del(13q) and a deletion spanning the CDKN2C/FAF1 locus on chromosome 1 were detected. Future efforts are directed towards integration and correlation of optical maps with whole genome sequencing and transcriptional profiling as well as establishing the frequency of prioritized genomic perturbations in bortezomib-sensitive and –resistant patient populations. The integration of structural optical maps with base-pair sequence information and transcriptomic tracks will generate an entirely new view of the multiple myeloma cancer genome at a previously unseen resolution. Fig. 1. Overview of the Optical Mapping platform. Bulk microscope cover glass is cleaned with a strong acid, then treated with a silane mixture to make positively charged Optical Mapping surfaces (i). A silicon wafer is patterned with standard photolithography techniques, and then replicated into a flexible PDMS microfluidic device (ii) using soft lithography. Finally, pure, high molecular-weight DNA (iii) is isolated from cultured eukaryotic cells using a gentle detergent-based lysis protocol. The microfluidic device is adhered to the Optical Mapping surface, and the DNA solution is pumped through the microchannels, wherein the DNA is elongated and attached to the Optical Mapping surface via electrostatic interaction (iv). The DNA is incubated with a restriction endonuclease (v), which cleaves the DNA at its cognate sites. The cleaved DNA is stained and imaged on an epifluorescence microscope (vi) illuminated by an argon-ion laser (vii) and controlled by a computer workstation (viii). Fig. 1. Overview of the Optical Mapping platform. Bulk microscope cover glass is cleaned with a strong acid, then treated with a silane mixture to make positively charged Optical Mapping surfaces (i). A silicon wafer is patterned with standard photolithography techniques, and then replicated into a flexible PDMS microfluidic device (ii) using soft lithography. Finally, pure, high molecular-weight DNA (iii) is isolated from cultured eukaryotic cells using a gentle detergent-based lysis protocol. The microfluidic device is adhered to the Optical Mapping surface, and the DNA solution is pumped through the microchannels, wherein the DNA is elongated and attached to the Optical Mapping surface via electrostatic interaction (iv). The DNA is incubated with a restriction endonuclease (v), which cleaves the DNA at its cognate sites. The cleaved DNA is stained and imaged on an epifluorescence microscope (vi) illuminated by an argon-ion laser (vii) and controlled by a computer workstation (viii). Fig. 2. Overview of the map assembly pipeline. Reference maps are generated in silico from the NCBI Build 35 human genome reference sequence, and used to seed an iterative process of pairwise alignment (which clusters together similar single-molecule maps) and local assembly (which generates a consensus optical map from a cluster of single-molecule maps). After several iterations of alignment and assembly, the consensus maps are aligned back to the reference map and analyzed for places where the consensus map differs significantly from the reference. Fig. 2. Overview of the map assembly pipeline. Reference maps are generated in silico from the NCBI Build 35 human genome reference sequence, and used to seed an iterative process of pairwise alignment (which clusters together similar single-molecule maps) and local assembly (which generates a consensus optical map from a cluster of single-molecule maps). After several iterations of alignment and assembly, the consensus maps are aligned back to the reference map and analyzed for places where the consensus map differs significantly from the reference. Disclosures: No relevant conflicts of interest to declare.


2021 ◽  
Author(s):  
Guohua Gao ◽  
Jeroen Vink ◽  
Fredrik Saaf ◽  
Terence Wells

Abstract When formulating history matching within the Bayesian framework, we may quantify the uncertainty of model parameters and production forecasts using conditional realizations sampled from the posterior probability density function (PDF). It is quite challenging to sample such a posterior PDF. Some methods e.g., Markov chain Monte Carlo (MCMC), are very expensive (e.g., MCMC) while others are cheaper but may generate biased samples. In this paper, we propose an unconstrained Gaussian Mixture Model (GMM) fitting method to approximate the posterior PDF and investigate new strategies to further enhance its performance. To reduce the CPU time of handling bound constraints, we reformulate the GMM fitting formulation such that an unconstrained optimization algorithm can be applied to find the optimal solution of unknown GMM parameters. To obtain a sufficiently accurate GMM approximation with the lowest number of Gaussian components, we generate random initial guesses, remove components with very small or very large mixture weights after each GMM fitting iteration and prevent their reappearance using a dedicated filter. To prevent overfitting, we only add a new Gaussian component if the quality of the GMM approximation on a (large) set of blind-test data sufficiently improves. The unconstrained GMM fitting method with the new strategies proposed in this paper is validated using nonlinear toy problems and then applied to a synthetic history matching example. It can construct a GMM approximation of the posterior PDF that is comparable to the MCMC method, and it is significantly more efficient than the constrained GMM fitting formulation, e.g., reducing the CPU time by a factor of 800 to 7300 for problems we tested, which makes it quite attractive for large scale history matching problems.


2021 ◽  
Author(s):  
Brian P. Anton ◽  
Alexey Fomenkov ◽  
Victoria Wu ◽  
Richard J. Roberts

ABSTRACTSingle-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.


PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0247541
Author(s):  
Brian P. Anton ◽  
Alexey Fomenkov ◽  
Victoria Wu ◽  
Richard J. Roberts

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.


2017 ◽  
Author(s):  
Mircea Cretu Stancu ◽  
Markus J. van Roosmalen ◽  
Ivo Renkens ◽  
Marleen Nieboer ◽  
Sjors Middelkamp ◽  
...  

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.


Sign in / Sign up

Export Citation Format

Share Document