Genome-wide epigenetic profiling of 5-hydroxymethylcytosine by long-read optical mapping

AbstractThe epigenetic mark 5-hydroxymethylcytosine (5-hmC) is a distinct product of active enzymatic demethylation that is linked to gene regulation, development and disease. Genome-wide 5-hmC profiles generated by short-read next-generation sequencing are limited in providing long-range epigenetic information relevant to highly variable genomic regions, such as the 3.7 Mbp disease-related Human Leukocyte Antigen (HLA) region. We present a long-read, single-molecule mapping technology that generates hybrid genetic/epigenetic profiles of native chromosomal DNA. The genome-wide distribution of 5- hmC in human peripheral blood cells correlates well with 5-hmC DNA immunoprecipitation (hMeDIP) sequencing. However, the long read length of 100 kbp-1Mbp produces 5-hmC profiles across variable genomic regions that failed to showup in the sequencing data. In addition, optical 5-hmC mapping shows strong correlation between the 5-hmC density in gene bodies and the corresponding level of gene expression. The single molecule concept provides information on the distribution and coexistence of 5-hmC signals at multiple genomic loci on the same genomic DNA molecule, revealing long-range correlations and cell-to-cell epigenetic variation.

Download Full-text

Cas9-Assisted Targeting of CHromosome segments (CATCH) for targeted nanopore sequencing and optical genome mapping

10.1101/110163 ◽

2017 ◽

Cited By ~ 5

Author(s):

Tslil Gabrieli ◽

Hila Sharim ◽

Yael Michaeli ◽

Yuval Ebenstein

Keyword(s):

Single Molecule ◽

Genome Mapping ◽

Single Point ◽

Read Length ◽

Whole Genome ◽

Sequencing Analysis ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Whole Genome Analysis ◽

Long Read

ABSTRACTVariations in the genetic code, from single point mutations to large structural or copy number alterations, influence susceptibility, onset, and progression of genetic diseases and tumor transformation. Next-generation sequencing analysis is unable to reliably capture aberrations larger than the typical sequencing read length of several hundred bases. Long-read, single-molecule sequencing methods such as SMRT and nanopore sequencing can address larger variations, but require costly whole genome analysis. Here we describe a method for isolation and enrichment of a large genomic region of interest for targeted analysis based on Cas9 excision of two sites flanking the target region and isolation of the excised DNA segment by pulsed field gel electrophoresis. The isolated target remains intact and is ideally suited for optical genome mapping and long-read sequencing at high coverage. In addition, analysis is performed directly on native genomic DNA that retains genetic and epigenetic composition without amplification bias. This method enables detection of mutations and structural variants as well as detailed analysis by generation of hybrid scaffolds composed of optical maps and sequencing data at a fraction of the cost of whole genome sequencing.

Download Full-text

Genome-wide integration site detection using Cas9 enriched amplification-free long-range sequencing

Nucleic Acids Research ◽

10.1093/nar/gkaa1152 ◽

2020 ◽

Author(s):

Joost van Haasteren ◽

Altar M Munis ◽

Deborah R Gill ◽

Stephen C Hyde

Keyword(s):

Long Range ◽

Insertional Mutagenesis ◽

Lentiviral Vector ◽

Integration Site ◽

Low Cost ◽

Safety Evaluation ◽

Read Length ◽

Chromosomal Dna ◽

Wide Range

Abstract The gene and cell therapy fields are advancing rapidly, with a potential to treat and cure a wide range of diseases, and lentivirus-based gene transfer agents are the vector of choice for many investigators. Early cases of insertional mutagenesis caused by gammaretroviral vectors highlighted that integration site (IS) analysis was a major safety and quality control checkpoint for lentiviral applications. The methods established to detect lentiviral integrations using next-generation sequencing (NGS) are limited by short read length, inadvertent PCR bias, low yield, or lengthy protocols. Here, we describe a new method to sequence IS using Amplification-free Integration Site sequencing (AFIS-Seq). AFIS-Seq is based on amplification-free, Cas9-mediated enrichment of high-molecular-weight chromosomal DNA suitable for long-range Nanopore MinION sequencing. This accessible and low-cost approach generates long reads enabling IS mapping with high certainty within a single day. We demonstrate proof-of-concept by mapping IS of lentiviral vectors in a variety of cell models and report up to 1600-fold enrichment of the signal. This method can be further extended to sequencing of Cas9-mediated integration of genes and to in vivo analysis of IS. AFIS-Seq uses long-read sequencing to facilitate safety evaluation of preclinical lentiviral vector gene therapies by providing IS analysis with improved confidence.

Download Full-text

Long-range single-molecule mapping of chromatin modification in eukaryotes

10.1101/2021.07.08.451578 ◽

2021 ◽

Author(s):

Zhe Weng ◽

Fengying Ruan ◽

Weitian Chen ◽

Zhe Xie ◽

Yeming Xie ◽

...

Keyword(s):

Histone Modification ◽

Long Range ◽

Single Molecule ◽

Protein A ◽

Chromatin Modification ◽

Protein Binding Sites ◽

Short Read Sequencing ◽

Third Generation Sequencing ◽

Long Read ◽

Generation Sequencing

The epigenetic modifications of histones are essential marks related to the development and disease pathogenesis, including human cancers. Mapping histone modification has emerged as the widely used tool for studying epigenetic regulation. However, existing approaches limited by fragmentation and short-read sequencing cannot provide information about the long-range chromatin states and represent the average chromatin status in samples. We leveraged the advantage of long read sequencing to develop a method "BIND&MODIFY" for profiling the histone modification of individual DNA fiber. Our approach is based on the recombinant fused protein A-EcoGII, which tethers the methyltransferase EcoGII to the protein binding sites and locally labels the neighboring DNA regions through artificial methylations. We demonstrate that the aggregated BIND&MODIFY signal matches the bulk-level ChIP-seq and CUT&TAG, observe the single-molecule heterogenous histone modification status, and quantify the correlation between distal elements. This method could be an essential tool in the future third-generation sequencing ages.

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

Improved haplotype inference by exploiting long-range linking and allelic imbalance in RNA-seq datasets

Nature Communications ◽

10.1038/s41467-020-18320-z ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Emily Berger ◽

Deniz Yorukoglu ◽

Lillian Zhang ◽

Sarah K. Nyquist ◽

Alex K. Shalek ◽

...

Keyword(s):

Long Range ◽

Genetic Variants ◽

Read Length ◽

Rna Seq ◽

Sequencing Data ◽

Specific Expression ◽

Integrative Framework ◽

Whole Exome ◽

Allele Specific ◽

Diverse Data

Abstract Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X’s feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10× faster than other tools. The advantage of HapTree-X’s ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.

Download Full-text

Transcriptome Profiling Provides Insight into the Genes in Carotenoid Biosynthesis during the Mesocarp and Seed Developmental Stages of Avocado (Persea americana)

International Journal of Molecular Sciences ◽

10.3390/ijms20174117 ◽

2019 ◽

Vol 20 (17) ◽

pp. 4117 ◽

Cited By ~ 8

Author(s):

Yu Ge ◽

Zhihao Cheng ◽

Xiongyuan Si ◽

Weihong Ma ◽

Lin Tan ◽

...

Keyword(s):

Single Molecule ◽

Developmental Stages ◽

Gene Dosage ◽

Average Length ◽

Beta Carotene ◽

Transcriptome Profiling ◽

Carotenoid Biosynthesis ◽

Persea Americana ◽

Sequencing Data ◽

Long Read

Avocado (Persea americana Mill.) is an economically important crop because of its high nutritional value. However, the absence of a sequenced avocado reference genome has hindered investigations of secondary metabolism. For next-generation high-throughput transcriptome sequencing, we obtained 365,615,152 and 348,623,402 clean reads as well as 109.13 and 104.10 Gb of sequencing data for avocado mesocarp and seed, respectively, during five developmental stages. High-quality reads were assembled into 100,837 unigenes with an average length of 847.40 bp (N50 = 1725 bp). Additionally, 16,903 differentially expressed genes (DEGs) were detected, 17 of which were related to carotenoid biosynthesis. The expression levels of most of these 17 DEGs were higher in the mesocarp than in the seed during five developmental stages. In this study, the avocado mesocarp and seed transcriptome were also sequenced using single-molecule long-read sequencing to acquired 25.79 and 17.67 Gb clean data, respectively. We identified 233,014 and 238,219 consensus isoforms in avocado mesocarp and seed, respectively. Furthermore, 104 and 59 isoforms were found to correspond to the putative 11 carotenoid biosynthetic-related genes in the avocado mesocarp and seed, respectively. The isoform numbers of 10 out of the putative 11 genes involved in the carotenoid biosynthetic pathway were higher in the mesocarp than those in the seed. Besides, alpha- and beta-carotene contents in the avocado mesocarp and seed during five developmental stages were also measured, and they were higher in the mesocarp than in the seed, which validated the results of transcriptome profiling. Gene expression changes and the associated variations in gene dosage could influence carotenoid biosynthesis. These results will help to further elucidate carotenoid biosynthesis in avocado.

Download Full-text

Kohdista: an efficient method to index and query possible Rmap alignments

Algorithms for Molecular Biology ◽

10.1186/s13015-019-0160-9 ◽

2019 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Martin D. Muggli ◽

Simon J. Puglisi ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

Restriction Enzymes ◽

Consensus Approach ◽

E Coli ◽

Genome Wide ◽

Alignment Problem ◽

Optical Map ◽

Optical Maps ◽

Map Data ◽

Genomic Regions

Abstract Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

Download Full-text

SvABA: Genome-wide detection of structural variants and indels by local assembly

10.1101/105080 ◽

2017 ◽

Cited By ~ 9

Author(s):

Jeremiah Wala ◽

Pratiti Bandopadhayay ◽

Noah Greenwald ◽

Ryan O’Rourke ◽

Ted Sharpe ◽

...

Keyword(s):

Variant Calling ◽

Accurate Method ◽

Structural Variants ◽

Sequencing Data ◽

Cancer Driver ◽

Insertion And Deletion ◽

Genome Wide ◽

Cancer Genomes ◽

Local Assembly ◽

Genomic Regions

AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

10.1101/2019.12.29.890418 ◽

2019 ◽

Cited By ~ 1

Author(s):

Umair Ahsan ◽

Qian Liu ◽

Li Fang ◽

Kai Wang

Keyword(s):

Deep Neural Network ◽

Deep Neural Networks ◽

Variant Calling ◽

Sequencing Data ◽

Long Reads ◽

Novel Variants ◽

Long Read ◽

Variant Detection ◽

Genomic Regions ◽

Haplotype Information

AbstractVariant (SNPs/indels) detection from high-throughput sequencing data remains an important yet unresolved problem. Long-read sequencing enables variant detection in difficult-to-map genomic regions that short-read sequencing cannot reliably examine (for example, only ~80% of genomic regions are marked as “high-confidence region” to have SNP/indel calls in the Genome In A Bottle project); however, the high per-base error rate poses unique challenges in variant detection. Existing methods on long-read data typically rely on analyzing pileup information from neighboring bases surrounding a candidate variant, similar to short-read variant callers, yet the benefits of much longer read length are not fully exploited. Here we present a deep neural network called NanoCaller, which detects SNPs by examining pileup information solely from other nonadjacent candidate SNPs that share the same long reads using long-range haplotype information. With called SNPs by NanoCaller, NanoCaller phases long reads and performs local realignment on two sets of phased reads to call indels by another deep neural network. Extensive evaluation on 5 human genomes (sequenced by Nanopore and PacBio long-read techniques) demonstrated that NanoCaller greatly improved performance in difficult-to-map regions, compared to other long-read variant callers. We experimentally validated 41 novel variants in difficult-to-map regions in a widely-used benchmarking genome, which cannot be reliably detected previously. We extensively evaluated the run-time characteristics and the sensitivity of parameter settings of NanoCaller to different characteristics of sequencing data. Finally, we achieved the best performance in Nanopore-based variant calling from MHC regions in the PrecisionFDA Variant Calling Challenge on Difficult-to-Map Regions by ensemble calling. In summary, by incorporating haplotype information in deep neural networks, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing data.

Download Full-text