Megabase-scale methylation phasing using nanopore long reads and NanoMethPhase

AbstractThe ability of nanopore sequencing to simultaneously detect modified nucleotides while producing long reads makes it ideal for detecting and phasing allele-specific methylation. However, there is currently no complete software for detecting SNPs, phasing haplotypes, and mapping methylation to these from nanopore sequence data. Here, we present NanoMethPhase, a software tool to phase 5-methylcytosine from nanopore sequencing. We also present SNVoter, which can post-process nanopore SNV calls to improve accuracy in low coverage regions. Together, these tools can accurately detect allele-specific methylation genome-wide using nanopore sequence data with low coverage of about ten-fold redundancy.

Download Full-text

Genome-Wide Detection of Imprinted Differentially Methylated Regions Using Nanopore Sequencing

10.1101/2021.07.17.452734 ◽

2021 ◽

Author(s):

Vahid Akbari ◽

Jean-Michel Garant ◽

Kieran O'Neill ◽

Pawan Pandoh ◽

Richard Moore ◽

...

Keyword(s):

B Lymphocyte ◽

Sequence Data ◽

Nanopore Sequencing ◽

Differentially Methylated Regions ◽

Methylation Array ◽

Current State ◽

Genome Wide ◽

Long Reads ◽

Genome Bisulfite Sequencing ◽

Potential Applications

Imprinting is a critical part of normal embryonic development in mammals, controlled by defined parent-of-origin (PofO) differentially methylated regions (DMRs) known as imprinting control regions. As we and others have shown, direct nanopore sequencing of DNA provides a mean to detect allelic methylation and to overcome the drawbacks of methylation array and short-read technologies. Here we leverage publicly-available nanopore sequence data for 12 standard B-lymphocyte cell lines to present the first genome-wide mapping of imprinted intervals in humans using this technology. We were able to phase 95% of the human methylome and detect 94% of the well-characterized imprinted DMRs. In addition, we found 28 novel imprinted DMRs (12 germline and 16 somatic), which we confirmed using whole-genome bisulfite sequencing (WGBS) data. Analysis of WGBS data in mouse, rhesus, and chimp suggested that 12 of these are conserved. We also detected subtle parental methylation bias spanning several kilobases at seven known imprinted clusters. These results expand the current state of knowledge of imprinting, with potential applications in the clinic. We have also demonstrated that nanopore long reads, can reveal imprinting using only parent-offspring trios, as opposed to the large multi-generational pedigrees that have previously been required.

Download Full-text

A Statistical Method for Observing Personal Diploid Methylomes and Transcriptomes with Single-Molecule Real-Time Sequencing

Genes ◽

10.3390/genes9090460 ◽

2018 ◽

Vol 9 (9) ◽

pp. 460 ◽

Cited By ~ 1

Author(s):

Yuta Suzuki ◽

Yunhao Wang ◽

Kin Au ◽

Shinichi Morishita

Keyword(s):

Statistical Model ◽

Real Time ◽

Single Molecule ◽

Error Rate ◽

Methylation Pattern ◽

Specific Expression ◽

Long Reads ◽

Allele Specific ◽

Complex Locus ◽

Allele Specific Methylation

We address the problem of observing personal diploid methylomes, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs), which is challenging due to scarcity of PHVs in personal genomes. Single molecule real-time (SMRT) sequencing is promising as it outputs long reads with CpG methylation information, but a serious concern is whether reliable PHVs are available in erroneous SMRT reads with an error rate of ∼15%. To overcome the issue, we propose a statistical model that reduces the error rate of phasing CpG site to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity. Using our statistical model, we examined GNAS complex locus known for a combination of maternally, paternally, or biallelically expressed isoforms, and observed allele-specific methylation pattern almost perfectly reflecting their respective allele-specific expression status, demonstrating the merit of elucidating comprehensive personal diploid methylomes and transcriptomes.

Download Full-text

Whole-genome analysis of Malawian Plasmodium falciparum isolates identifies potential targets of allele-specific immunity to clinical malaria

10.1101/2020.09.16.20196253 ◽

2020 ◽

Author(s):

Zalak Shah ◽

Myo T Naung ◽

Kara A Moser ◽

Matthew Adams ◽

Andrea G Buchwald ◽

...

Keyword(s):

Plasmodium Falciparum ◽

Sequence Data ◽

Clinical Malaria ◽

Whole Genome Sequence ◽

Whole Genome ◽

Whole Genome Analysis ◽

Multiple Alleles ◽

Vaccine Candidates ◽

Genome Wide ◽

Allele Specific

Individuals acquire immunity to clinical malaria after repeated Plasmodium falciparum infections. This immunity to disease is thought to reflect the acquisition of a repertoire of responses to multiple alleles in diverse parasite antigens. In previous studies, we identified polymorphic sites within individual antigens that are associated with parasite immune evasion by examining antigen allele dynamics in individuals followed longitudinally. Here we expand this approach by analyzing genome-wide polymorphisms using whole genome sequence data from 140 parasite isolates representing malaria cases from a longitudinal study in Malawi and identify 25 genes that encode likely targets of naturally acquired immunity and that should be further characterized for their potential as vaccine candidates.

Download Full-text

A sorghum Practical Haplotype Graph facilitates genome-wide imputation and cost-effective genomic prediction

10.1101/775221 ◽

2019 ◽

Author(s):

Sarah E. Jensen ◽

Jean Rigaud Charles ◽

Kebede Muleta ◽

Peter Bradbury ◽

Terry Casstevens ◽

...

Keyword(s):

Genomic Selection ◽

Genomic Prediction ◽

Sequence Data ◽

Input Sequence ◽

Genotyping By Sequencing ◽

Cost Effective ◽

Genome Wide ◽

Variant Information ◽

Sequencing Platforms ◽

Low Coverage

AbstractSuccessful management and utilization of increasingly large genomic datasets is essential for breeding programs to increase genetic gain and accelerate cultivar development. To help with data management and storage, we developed a sorghum Practical Haplotype Graph (PHG) pangenome database that stores all identified haplotypes and variant information for a given set of individuals. We developed two PHGs in sorghum, one with 24 individuals and another with 398 individuals, that reflect the diversity across genic regions of the sorghum genome. 24 founders of the Chibas sorghum breeding program were sequenced at low coverage (0.01x) and processed through the PHG to identify genome-wide variants. The PHG called SNPs with only 5.9% error at 0.01x coverage - only 3% lower than its accuracy when calling SNPs from 8x coverage sequence. Additionally, 207 progeny from the Chibas genomic selection (GS) training population were sequenced and processed through the PHG. Missing genotypes in the progeny were imputed from the parental haplotypes available in the PHG and used for genomic prediction. Mean prediction accuracies with PHG SNP calls range from 0.57-0.73 for different traits, and are similar to prediction accuracies obtained with genotyping-by-sequencing (GBS) or markers from sequencing targeted amplicons (rhAmpSeq). This study provides a proof of concept for using a sorghum PHG to call and impute SNPs from low-coverage sequence data and also shows that the PHG can unify genotype calls from different sequencing platforms. By reducing the amount of input sequence needed, the PHG has the potential to decrease the cost of genotyping for genomic selection, making GS more feasible and facilitating larger breeding populations that can capture maximum recombination. Our results demonstrate that the PHG is a useful research and breeding tool that can maintain variant information from a diverse group of taxa, store sequence data in a condensed but readily accessible format, unify genotypes from different genotyping methods, and provide a cost-effective option for genomic selection for any species.

Download Full-text

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

10.1101/125534 ◽

2017 ◽

Author(s):

Chong Chu ◽

Xin Li ◽

Yufeng Wu

Keyword(s):

Sequence Data ◽

Bacterial Genome ◽

Software Tool ◽

Sea Bass ◽

Short Sequence ◽

Asian Sea Bass ◽

Long Reads ◽

Local Assembly ◽

Genomic Repeats ◽

Gap Closing

AbstractBackgroundClosing gaps in draft genomes is an important post processing step in genome assembly. It leads to more complete genomes, which benefits downstream genome analysis such as annotation and genotyping. Several tools have been developed for gap closing. However, these tools don’t fully utilize the information contained in the sequence data. For example, while it is known that many gaps are caused by genomic repeats, existing tools often ignore many sequence reads that originate from a repeat-related gap.ResultsIn this paper, we propose a new approach called GAPPadder for gap closing. The main advantage of GAPPadder is that it uses more information in sequence data for gap closing. In particular, GAPPadder finds and uses reads that originate from repeate-related gaps. We show that these repeat-associated reads are useful for gap closing, even though they are ignored by all existing tools. Other main features of GAPPadder include utilizing the information in sequence reads with different insert sizes and performing two-stage local assembly of gap sequences. We compare GAPPadder with GapCloser, GapFiller and Sealer on one bacterial genome, human chromosome 14 and the human whole genome with paired-end and mate-paired reads with both short and long insert sizes. Empirical results show that GAPPadder can close more gaps than these existing tools. Besides closing gaps on draft genomes assembled only from short sequence reads, GAPPadder can also be used to close gaps for draft genomes assembled with long reads. We show GAPPadder can close gaps on the bed bug genome and the Asian sea bass genome that are assembled partially and fully with long reads respectively. We also show GAPPadder is efficient in both time and memory usage. The software tool, GAPPadder, is available for download at https://github.com/Reedwarbler/GAPPadder.

Download Full-text

ASHIC: hierarchical Bayesian modeling of diploid chromatin contacts and structures

Nucleic Acids Research ◽

10.1093/nar/gkaa872 ◽

2020 ◽

Vol 48 (21) ◽

pp. e123-e123

Author(s):

Tiantian Ye ◽

Wenxiu Ma

Keyword(s):

Bayesian Framework ◽

Hierarchical Bayesian ◽

3D Structures ◽

Chromatin Interactions ◽

Genome Wide ◽

Specific Contact ◽

Rigorous Framework ◽

Allele Specific ◽

Fine Resolution ◽

Low Coverage

Abstract The recently developed Hi-C technique has been widely applied to map genome-wide chromatin interactions. However, current methods for analyzing diploid Hi-C data cannot fully distinguish between homologous chromosomes. Consequently, the existing diploid Hi-C analyses are based on sparse and inaccurate allele-specific contact matrices, which might lead to incorrect modeling of diploid genome architecture. Here we present ASHIC, a hierarchical Bayesian framework to model allele-specific chromatin organizations in diploid genomes. We developed two models under the Bayesian framework: the Poisson-multinomial (ASHIC-PM) model and the zero-inflated Poisson-multinomial (ASHIC-ZIPM) model. The proposed ASHIC methods impute allele-specific contact maps from diploid Hi-C data and simultaneously infer allelic 3D structures. Through simulation studies, we demonstrated that ASHIC methods outperformed existing approaches, especially under low coverage and low SNP density conditions. Additionally, in the analyses of diploid Hi-C datasets in mouse and human, our ASHIC-ZIPM method produced fine-resolution diploid chromatin maps and 3D structures and provided insights into the allelic chromatin organizations and functions. To summarize, our work provides a statistically rigorous framework for investigating fine-scale allele-specific chromatin conformations. The ASHIC software is publicly available at https://github.com/wmalab/ASHIC.

Download Full-text

Genome-wide detection of cytosine methylation by single molecule real-time sequencing

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2019768118 ◽

2021 ◽

Vol 118 (5) ◽

pp. e2019768118

Author(s):

O. Y. Olivia Tse ◽

Peiyong Jiang ◽

Suk Hang Cheng ◽

Wenlei Peng ◽

Huimin Shang ◽

...

Keyword(s):

Real Time ◽

Single Molecule ◽

Cytosine Methylation ◽

Epigenetic Modification ◽

Methylation Status ◽

Area Under The Curve ◽

Buffy Coat ◽

Genome Wide ◽

Allele Specific ◽

Allele Specific Methylation

5-Methylcytosine (5mC) is an important type of epigenetic modification. Bisulfite sequencing (BS-seq) has limitations, such as severe DNA degradation. Using single molecule real-time sequencing, we developed a methodology to directly examine 5mC. This approach holistically examined kinetic signals of a DNA polymerase (including interpulse duration and pulse width) and sequence context for every nucleotide within a measurement window, termed the holistic kinetic (HK) model. The measurement window of each analyzed double-stranded DNA molecule comprised 21 nucleotides with a cytosine in a CpG site in the center. We used amplified DNA (unmethylated) and M.SssI-treated DNA (methylated) (M.SssI being a CpG methyltransferase) to train a convolutional neural network. The area under the curve for differentiating methylation states using such samples was up to 0.97. The sensitivity and specificity for genome-wide 5mC detection at single-base resolution reached 90% and 94%, respectively. The HK model was then tested on human–mouse hybrid fragments in which each member of the hybrid had a different methylation status. The model was also tested on human genomic DNA molecules extracted from various biological samples, such as buffy coat, placental, and tumoral tissues. The overall methylation levels deduced by the HK model were well correlated with those by BS-seq (r = 0.99; P < 0.0001) and allowed the measurement of allele-specific methylation patterns in imprinted genes. Taken together, this methodology has provided a system for simultaneous genome-wide genetic and epigenetic analyses.

Download Full-text

TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

GigaScience ◽

10.1093/gigascience/giaa094 ◽

2020 ◽

Vol 9 (9) ◽

Cited By ~ 3

Author(s):

Mengyang Xu ◽

Lidong Guo ◽

Shengqiang Gu ◽

Ou Wang ◽

Rui Zhang ◽

...

Keyword(s):

Single Molecule ◽

Sequence Data ◽

Software Tool ◽

Fold Increase ◽

Substantial Improvement ◽

Large Genome ◽

Long Reads ◽

Close Sequence ◽

Genome Assemblies ◽

Large Genomes

Abstract Background Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (∼10×) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only ∼10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (∼12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.

Download Full-text

Accurate, ultra-low coverage genome reconstruction and association studies in Hybrid Swarm mapping populations

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab062 ◽

2021 ◽

Vol 11 (4) ◽

Author(s):

Cory A Weller ◽

Susanne Tilk ◽

Subhash Rajpurohit ◽

Alan O Bergland

Keyword(s):

Genetic Variation ◽

Sequence Data ◽

Association Studies ◽

Recombinant Inbred Lines ◽

Natural Populations ◽

Inbred Lines ◽

Genome Wide Association Studies ◽

Hybrid Swarm ◽

Genome Wide ◽

Low Coverage

Abstract Genetic association studies seek to uncover the link between genotype and phenotype, and often utilize inbred reference panels as a replicable source of genetic variation. However, inbred reference panels can differ substantially from wild populations in their genotypic distribution, patterns of linkage-disequilibrium, and nucleotide diversity. As a result, associations discovered using inbred reference panels may not reflect the genetic basis of phenotypic variation in natural populations. To address this problem, we evaluated a mapping population design where dozens to hundreds of inbred lines are outbred for few generations, which we call the Hybrid Swarm. The Hybrid Swarm approach has likely remained underutilized relative to pre-sequenced inbred lines due to the costs of genome-wide genotyping. To reduce sequencing costs and make the Hybrid Swarm approach feasible, we developed a computational pipeline that reconstructs accurate whole genomes from ultra-low-coverage (0.05X) sequence data in Hybrid Swarm populations derived from ancestors with phased haplotypes. We evaluate reconstructions using genetic variation from the Drosophila Genetic Reference Panel as well as variation from neutral simulations. We compared the power and precision of Genome-Wide Association Studies using the Hybrid Swarm, inbred lines, recombinant inbred lines (RILs), and highly outbred populations across a range of allele frequencies, effect sizes, and genetic architectures. Our simulations show that these different mapping panels vary in their power and precision, largely depending on the architecture of the trait. The Hybrid Swam and RILs outperform inbred lines for quantitative traits, but not for monogenic ones. Taken together, our results demonstrate the feasibility of the Hybrid Swarm as a cost-effective method of fine-scale genetic mapping.

Download Full-text

Nanopore sequencing and assembly of a human genome with ultra-long reads

10.1101/128835 ◽

2017 ◽

Cited By ~ 51

Author(s):

Miten Jain ◽

S Koren ◽

J Quick ◽

AC Rand ◽

TA Sasani ◽

...

Keyword(s):

Human Genome ◽

Cancer Progression ◽

De Novo ◽

Sequence Data ◽

Point Of Care ◽

Genetic Diseases ◽

Nanopore Sequencing ◽

Repeat Structure ◽

Long Reads ◽

Amazon Web Services

AbstractNanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (∼30× theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ∼3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5×-coverage of ‘ultra-long’ reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.

Download Full-text